I've started looking into CouchDB of late; I think it may be a great solution to some of our common issues around collecting "survey" style data from users. Our current approach uses a big nasty hierarchy of Hibernate entities.
I wanted to have some real, and real-world, data. Since I'm a big board gamer, I thought it would be nice to have a local copy of the BoardGameGeek database. I wrote a program that scrapes data from the BGG site and loads it into CouchDB.
#!/usr/bin/ruby require 'rubygems' require 'hpricot' require 'open-uri' require 'thread' require 'couchrest' # Extension to Hpricot class Hpricot::Elem # content at # if block given, return value comes from yielding # actual content to the block def cat(expr) result = self.at(expr) return nil unless result content = result.inner_html.fixup return (yield content) if block_given? return content end # content-at converted to int (via to_i) def cati(expr) self.cat(expr) { |content| content.to_i } end def search_to_text(expr) self.search(expr).map { |e| e.inner_text.fixup} end end $BGG = "http://boardgamegeek.com" $catalog_pages = 0 $game_pages = 0 $games_added = 0 # Access to couch db. From what I can tell, the Databsae is multithreaded. $DB = CouchRest.database!("http://127.0.0.1:5984/board-game-geek") # The thread pool. $POOL_SIZE = 10 $QUEUE = Queue.new workers = (1..$POOL_SIZE).map do |i| Thread.new("worker #{i}") do begin proc = $QUEUE.deq proc.call() end until $QUEUE.empty? end end def enqueue &action $QUEUE << action end def parse_browser_page(url) end class String def fixup() self.gsub("'", "'").gsub("&", "&") end end def parse_and_load_game(game_id) $game_pages += 1 page = Hpricot.XML(open("#$BGG/xmlapi/boardgame/#{game_id}?comments=1&stats=1")) bg = page.at("//boardgame") ratings = bg.at("//ratings") comments = (bg/"comment").map do |e| { "user" => e[:username].fixup, "comment" => e.inner_text.fixup } end doc = { "_id" => game_id, "title" => bg.cat("name[@primary='true']"), "description" => bg.cat("description"), "designers" => bg.search_to_text("boardgamedesigner"), "artists" => bg.search_to_text("boardgameartist"), "publishers" => bg.search_to_text("boardgamepublisher"), "published" => bg.cati("yearpublished"), "categories" => bg.search_to_text("boardgamecategory"), "mechanics" => bg.search_to_text("boardgamemechanic"), "images" =>{ "url", bg.cat("image"), "thumbnailUrl", bg.cat("thumbnail"), }, "players" => { "min" => bg.cati("minplayers"), "max" => bg.cati("maxplayers"), "age" => bg.cati("age") }, "stats" => { "rank" => ratings.cati("rank"), "averageRating" => ratings.cat("average") { |content| content.to_f }, "ownedCount" => ratings.cati("owned") }, "comments" => comments } enqueue do $games_added += 1 $DB.save_doc(doc) end end def process_game(game_id) begin doc = $DB.get(game_id) # Found, do nothing rescue RestClient::ResourceNotFound # Not in the database yet, so fire off a request to parse its page. enqueue { parse_and_load_game game_id } end end def parse_catalog_page(url) puts("[%24s] %4d catalog pages, %4d/%4d games parsed/added (%4d actions queued)" % [Time.now.ctime, $catalog_pages, $game_pages, $games_added, $QUEUE.length]) doc = Hpricot(open(url)) $catalog_pages += 1 doc.search("//table[@id='collectionitems']/tr/td[3]//a") do |elem| href = elem[:href] game_id = href.split('/').last() enqueue { process_game game_id } end next_page_link = doc.at("//a[@title='next page']") return unless next_page_link next_page = next_page_link[:href] # Add this last, to give the other actions a chance to operate. # A better mechanism would be a priority-based queue, where the catalog # page parse action is lower priority than the other actions. enqueue { parse_catalog_page($BGG + next_page) } end # Kick it off start_page = $ARGV[0] || ($BGG + "/browse/boardgame") enqueue { parse_catalog_page start_page } # Wait for all the workers to complete workers.each { |th| th.join } puts "BoardGameGeek loader complete." puts "Queue not empty!" unless $QUEUE.empty?
Key pieces of this is Hpricot to parse HTML and XML, and CouchRest to get the data into CouchDB.
Did I go overboard? I don't think so ... this runs in about two hours, and created a database of nearly 40,000 documents (about 340MB). Using a thread pool just seemed to make sense, since (outside of the XML parsing), every aspect of this is I/O bound: pulling data from BGG or pushing data to CouchDB.
The code is naive about some threading issues and simply crashes if there's an error. Oh well.
Next up: learning how to build views against this data and deciding how to use it all. I'm thinking Cappuccino.
Man... where the hell you find time to play around like you do?
ReplyDeleteI was digging CouchDB and Cappuccino at JSConf2009 last weekend. Cool stuff.
ReplyDeleteMassimo: I find time to skim the surface of things I'd prefer to dive into deeply, and then usually on the weekend, crowded around everything else. Still I have more freedom than most.
ReplyDeleteThis should be really easy to do using Groovy + Tagsoup and the database of your choice. Maybe even using GORM as I think GORM is now more modularized and could be used in a standard Groovy app with a little research. The Groovy XML parser is great and with Tagsoup scraping HTML is quite easy.
ReplyDeletekdorff: If I wanted to dirty my hands with the JVM, I would have used Clojure :-)
ReplyDelete