I've started looking into CouchDB of late; I think it may be a great solution to some of our common issues around collecting "survey" style data from users. Our current approach uses a big nasty hierarchy of Hibernate entities.
I wanted to have some real, and real-world, data. Since I'm a big board gamer, I thought it would be nice to have a local copy of the BoardGameGeek database. I wrote a program that scrapes data from the BGG site and loads it into CouchDB.
#!/usr/bin/ruby
require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'thread'
require 'couchrest'
# Extension to Hpricot
class Hpricot::Elem
# content at
# if block given, return value comes from yielding
# actual content to the block
def cat(expr)
result = self.at(expr)
return nil unless result
content = result.inner_html.fixup
return (yield content) if block_given?
return content
end
# content-at converted to int (via to_i)
def cati(expr)
self.cat(expr) { |content| content.to_i }
end
def search_to_text(expr)
self.search(expr).map { |e| e.inner_text.fixup}
end
end
$BGG = "http://boardgamegeek.com"
$catalog_pages = 0
$game_pages = 0
$games_added = 0
# Access to couch db. From what I can tell, the Databsae is multithreaded.
$DB = CouchRest.database!("http://127.0.0.1:5984/board-game-geek")
# The thread pool.
$POOL_SIZE = 10
$QUEUE = Queue.new
workers = (1..$POOL_SIZE).map do |i|
Thread.new("worker #{i}") do
begin
proc = $QUEUE.deq
proc.call()
end until $QUEUE.empty?
end
end
def enqueue &action
$QUEUE << action
end
def parse_browser_page(url)
end
class String
def fixup()
self.gsub("'", "'").gsub("&", "&")
end
end
def parse_and_load_game(game_id)
$game_pages += 1
page = Hpricot.XML(open("#$BGG/xmlapi/boardgame/#{game_id}?comments=1&stats=1"))
bg = page.at("//boardgame")
ratings = bg.at("//ratings")
comments = (bg/"comment").map do |e|
{
"user" => e[:username].fixup,
"comment" => e.inner_text.fixup
}
end
doc = {
"_id" => game_id,
"title" => bg.cat("name[@primary='true']"),
"description" => bg.cat("description"),
"designers" => bg.search_to_text("boardgamedesigner"),
"artists" => bg.search_to_text("boardgameartist"),
"publishers" => bg.search_to_text("boardgamepublisher"),
"published" => bg.cati("yearpublished"),
"categories" => bg.search_to_text("boardgamecategory"),
"mechanics" => bg.search_to_text("boardgamemechanic"),
"images" =>{
"url", bg.cat("image"),
"thumbnailUrl", bg.cat("thumbnail"),
},
"players" => {
"min" => bg.cati("minplayers"),
"max" => bg.cati("maxplayers"),
"age" => bg.cati("age")
},
"stats" => {
"rank" => ratings.cati("rank"),
"averageRating" => ratings.cat("average") { |content| content.to_f },
"ownedCount" => ratings.cati("owned")
},
"comments" => comments
}
enqueue do
$games_added += 1
$DB.save_doc(doc)
end
end
def process_game(game_id)
begin
doc = $DB.get(game_id)
# Found, do nothing
rescue RestClient::ResourceNotFound
# Not in the database yet, so fire off a request to parse its page.
enqueue { parse_and_load_game game_id }
end
end
def parse_catalog_page(url)
puts("[%24s] %4d catalog pages, %4d/%4d games parsed/added (%4d actions queued)" % [Time.now.ctime, $catalog_pages,
$game_pages, $games_added, $QUEUE.length])
doc = Hpricot(open(url))
$catalog_pages += 1
doc.search("//table[@id='collectionitems']/tr/td[3]//a") do |elem|
href = elem[:href]
game_id = href.split('/').last()
enqueue { process_game game_id }
end
next_page_link = doc.at("//a[@title='next page']")
return unless next_page_link
next_page = next_page_link[:href]
# Add this last, to give the other actions a chance to operate.
# A better mechanism would be a priority-based queue, where the catalog
# page parse action is lower priority than the other actions.
enqueue { parse_catalog_page($BGG + next_page) }
end
# Kick it off
start_page = $ARGV[0] || ($BGG + "/browse/boardgame")
enqueue { parse_catalog_page start_page }
# Wait for all the workers to complete
workers.each { |th| th.join }
puts "BoardGameGeek loader complete."
puts "Queue not empty!" unless $QUEUE.empty?
Key pieces of this is Hpricot to parse HTML and XML, and CouchRest to get the data into CouchDB.
Did I go overboard? I don't think so ... this runs in about two hours, and created a database of nearly 40,000 documents (about 340MB). Using a thread pool just seemed to make sense, since (outside of the XML parsing), every aspect of this is I/O bound: pulling data from BGG or pushing data to CouchDB.
The code is naive about some threading issues and simply crashes if there's an error. Oh well.
Next up: learning how to build views against this data and deciding how to use it all. I'm thinking Cappuccino.
Man... where the hell you find time to play around like you do?
ReplyDeleteI was digging CouchDB and Cappuccino at JSConf2009 last weekend. Cool stuff.
ReplyDeleteMassimo: I find time to skim the surface of things I'd prefer to dive into deeply, and then usually on the weekend, crowded around everything else. Still I have more freedom than most.
ReplyDeleteThis should be really easy to do using Groovy + Tagsoup and the database of your choice. Maybe even using GORM as I think GORM is now more modularized and could be used in a standard Groovy app with a little research. The Groovy XML parser is great and with Tagsoup scraping HTML is quite easy.
ReplyDeletekdorff: If I wanted to dirty my hands with the JVM, I would have used Clojure :-)
ReplyDelete