I've started looking into CouchDB of late; I think it may be a great solution to some of our common issues around collecting "survey" style data from users. Our current approach uses a big nasty hierarchy of Hibernate entities.
I wanted to have some real, and real-world, data. Since I'm a big board gamer, I thought it would be nice to have a local copy of the BoardGameGeek database. I wrote a program that scrapes data from the BGG site and loads it into CouchDB.
#!/usr/bin/ruby
require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'thread'
require 'couchrest'
# Extension to Hpricot
class Hpricot::Elem
# content at
# if block given, return value comes from yielding
# actual content to the block
def cat(expr)
result = self.at(expr)
return nil unless result
content = result.inner_html.fixup
return (yield content) if block_given?
return content
end
# content-at converted to int (via to_i)
def cati(expr)
self.cat(expr) { |content| content.to_i }
end
def search_to_text(expr)
self.search(expr).map { |e| e.inner_text.fixup}
end
end
$BGG = "http://boardgamegeek.com"
$catalog_pages = 0
$game_pages = 0
$games_added = 0
# Access to couch db. From what I can tell, the Databsae is multithreaded.
$DB = CouchRest.database!("http://127.0.0.1:5984/board-game-geek")
# The thread pool.
$POOL_SIZE = 10
$QUEUE = Queue.new
workers = (1..$POOL_SIZE).map do |i|
Thread.new("worker #{i}") do
begin
proc = $QUEUE.deq
proc.call()
end until $QUEUE.empty?
end
end
def enqueue &action
$QUEUE << action
end
def parse_browser_page(url)
end
class String
def fixup()
self.gsub("'", "'").gsub("&", "&")
end
end
def parse_and_load_game(game_id)
$game_pages += 1
page = Hpricot.XML(open("#$BGG/xmlapi/boardgame/#{game_id}?comments=1&stats=1"))
bg = page.at("//boardgame")
ratings = bg.at("//ratings")
comments = (bg/"comment").map do |e|
{
"user" => e[:username].fixup,
"comment" => e.inner_text.fixup
}
end
doc = {
"_id" => game_id,
"title" => bg.cat("name[@primary='true']"),
"description" => bg.cat("description"),
"designers" => bg.search_to_text("boardgamedesigner"),
"artists" => bg.search_to_text("boardgameartist"),
"publishers" => bg.search_to_text("boardgamepublisher"),
"published" => bg.cati("yearpublished"),
"categories" => bg.search_to_text("boardgamecategory"),
"mechanics" => bg.search_to_text("boardgamemechanic"),
"images" =>{
"url", bg.cat("image"),
"thumbnailUrl", bg.cat("thumbnail"),
},
"players" => {
"min" => bg.cati("minplayers"),
"max" => bg.cati("maxplayers"),
"age" => bg.cati("age")
},
"stats" => {
"rank" => ratings.cati("rank"),
"averageRating" => ratings.cat("average") { |content| content.to_f },
"ownedCount" => ratings.cati("owned")
},
"comments" => comments
}
enqueue do
$games_added += 1
$DB.save_doc(doc)
end
end
def process_game(game_id)
begin
doc = $DB.get(game_id)
# Found, do nothing
rescue RestClient::ResourceNotFound
# Not in the database yet, so fire off a request to parse its page.
enqueue { parse_and_load_game game_id }
end
end
def parse_catalog_page(url)
puts("[%24s] %4d catalog pages, %4d/%4d games parsed/added (%4d actions queued)" % [Time.now.ctime, $catalog_pages,
$game_pages, $games_added, $QUEUE.length])
doc = Hpricot(open(url))
$catalog_pages += 1
doc.search("//table[@id='collectionitems']/tr/td[3]//a") do |elem|
href = elem[:href]
game_id = href.split('/').last()
enqueue { process_game game_id }
end
next_page_link = doc.at("//a[@title='next page']")
return unless next_page_link
next_page = next_page_link[:href]
# Add this last, to give the other actions a chance to operate.
# A better mechanism would be a priority-based queue, where the catalog
# page parse action is lower priority than the other actions.
enqueue { parse_catalog_page($BGG + next_page) }
end
# Kick it off
start_page = $ARGV[0] || ($BGG + "/browse/boardgame")
enqueue { parse_catalog_page start_page }
# Wait for all the workers to complete
workers.each { |th| th.join }
puts "BoardGameGeek loader complete."
puts "Queue not empty!" unless $QUEUE.empty?
Key pieces of this is Hpricot to parse HTML and XML, and CouchRest to get the data into CouchDB.
Did I go overboard? I don't think so ... this runs in about two hours, and created a database of nearly 40,000 documents (about 340MB). Using a thread pool just seemed to make sense, since (outside of the XML parsing), every aspect of this is I/O bound: pulling data from BGG or pushing data to CouchDB.
The code is naive about some threading issues and simply crashes if there's an error. Oh well.
Next up: learning how to build views against this data and deciding how to use it all. I'm thinking Cappuccino.
5 comments:
Man... where the hell you find time to play around like you do?
I was digging CouchDB and Cappuccino at JSConf2009 last weekend. Cool stuff.
Massimo: I find time to skim the surface of things I'd prefer to dive into deeply, and then usually on the weekend, crowded around everything else. Still I have more freedom than most.
This should be really easy to do using Groovy + Tagsoup and the database of your choice. Maybe even using GORM as I think GORM is now more modularized and could be used in a standard Groovy app with a little research. The Groovy XML parser is great and with Tagsoup scraping HTML is quite easy.
kdorff: If I wanted to dirty my hands with the JVM, I would have used Clojure :-)
Post a Comment