I've started looking into CouchDB of late; I think it may be a great solution to some of our common issues around collecting "survey" style data from users. Our current approach uses a big nasty hierarchy of Hibernate entities.
I wanted to have some real, and real-world, data. Since I'm a big board gamer, I thought it would be nice to have a local copy of the BoardGameGeek database. I wrote a program that scrapes data from the BGG site and loads it into CouchDB.
#!/usr/bin/ruby require 'rubygems' require 'hpricot' require 'open-uri' require 'thread' require 'couchrest' # Extension to Hpricot class Hpricot::Elem # content at # if block given, return value comes from yielding # actual content to the block def cat(expr) result = self.at(expr) return nil unless result content = result.inner_html.fixup return (yield content) if block_given? return content end # content-at converted to int (via to_i) def cati(expr) self.cat(expr) { |content| content.to_i } end def search_to_text(expr) self.search(expr).map { |e| e.inner_text.fixup} end end $BGG = "http://boardgamegeek.com" $catalog_pages = 0 $game_pages = 0 $games_added = 0 # Access to couch db. From what I can tell, the Databsae is multithreaded. $DB = CouchRest.database!("http://127.0.0.1:5984/board-game-geek") # The thread pool. $POOL_SIZE = 10 $QUEUE = Queue.new workers = (1..$POOL_SIZE).map do |i| Thread.new("worker #{i}") do begin proc = $QUEUE.deq proc.call() end until $QUEUE.empty? end end def enqueue &action $QUEUE << action end def parse_browser_page(url) end class String def fixup() self.gsub("'", "'").gsub("&", "&") end end def parse_and_load_game(game_id) $game_pages += 1 page = Hpricot.XML(open("#$BGG/xmlapi/boardgame/#{game_id}?comments=1&stats=1")) bg = page.at("//boardgame") ratings = bg.at("//ratings") comments = (bg/"comment").map do |e| { "user" => e[:username].fixup, "comment" => e.inner_text.fixup } end doc = { "_id" => game_id, "title" => bg.cat("name[@primary='true']"), "description" => bg.cat("description"), "designers" => bg.search_to_text("boardgamedesigner"), "artists" => bg.search_to_text("boardgameartist"), "publishers" => bg.search_to_text("boardgamepublisher"), "published" => bg.cati("yearpublished"), "categories" => bg.search_to_text("boardgamecategory"), "mechanics" => bg.search_to_text("boardgamemechanic"), "images" =>{ "url", bg.cat("image"), "thumbnailUrl", bg.cat("thumbnail"), }, "players" => { "min" => bg.cati("minplayers"), "max" => bg.cati("maxplayers"), "age" => bg.cati("age") }, "stats" => { "rank" => ratings.cati("rank"), "averageRating" => ratings.cat("average") { |content| content.to_f }, "ownedCount" => ratings.cati("owned") }, "comments" => comments } enqueue do $games_added += 1 $DB.save_doc(doc) end end def process_game(game_id) begin doc = $DB.get(game_id) # Found, do nothing rescue RestClient::ResourceNotFound # Not in the database yet, so fire off a request to parse its page. enqueue { parse_and_load_game game_id } end end def parse_catalog_page(url) puts("[%24s] %4d catalog pages, %4d/%4d games parsed/added (%4d actions queued)" % [Time.now.ctime, $catalog_pages, $game_pages, $games_added, $QUEUE.length]) doc = Hpricot(open(url)) $catalog_pages += 1 doc.search("//table[@id='collectionitems']/tr/td[3]//a") do |elem| href = elem[:href] game_id = href.split('/').last() enqueue { process_game game_id } end next_page_link = doc.at("//a[@title='next page']") return unless next_page_link next_page = next_page_link[:href] # Add this last, to give the other actions a chance to operate. # A better mechanism would be a priority-based queue, where the catalog # page parse action is lower priority than the other actions. enqueue { parse_catalog_page($BGG + next_page) } end # Kick it off start_page = $ARGV[0] || ($BGG + "/browse/boardgame") enqueue { parse_catalog_page start_page } # Wait for all the workers to complete workers.each { |th| th.join } puts "BoardGameGeek loader complete." puts "Queue not empty!" unless $QUEUE.empty?
Key pieces of this is Hpricot to parse HTML and XML, and CouchRest to get the data into CouchDB.
Did I go overboard? I don't think so ... this runs in about two hours, and created a database of nearly 40,000 documents (about 340MB). Using a thread pool just seemed to make sense, since (outside of the XML parsing), every aspect of this is I/O bound: pulling data from BGG or pushing data to CouchDB.
The code is naive about some threading issues and simply crashes if there's an error. Oh well.
Next up: learning how to build views against this data and deciding how to use it all. I'm thinking Cappuccino.