Tapestry Training -- From The Source

Let me help you get your team up to speed in Tapestry ... fast. Visit howardlewisship.com for details on training, mentoring and support!

Monday, April 27, 2009

Ruby and CouchDB

I've started looking into CouchDB of late; I think it may be a great solution to some of our common issues around collecting "survey" style data from users. Our current approach uses a big nasty hierarchy of Hibernate entities.

I wanted to have some real, and real-world, data. Since I'm a big board gamer, I thought it would be nice to have a local copy of the BoardGameGeek database. I wrote a program that scrapes data from the BGG site and loads it into CouchDB.


require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'thread'
require 'couchrest'

# Extension to Hpricot

class Hpricot::Elem
  # content at
  # if block given, return value comes from yielding
  # actual content to the block
  def cat(expr)
    result = self.at(expr)

    return nil unless result

    content = result.inner_html.fixup

    return (yield content) if block_given?

    return content

  # content-at converted to int (via to_i)

  def cati(expr)
    self.cat(expr) { |content| content.to_i }

  def search_to_text(expr)
    self.search(expr).map { |e| e.inner_text.fixup}

$BGG = "http://boardgamegeek.com"

$catalog_pages = 0
$game_pages = 0
$games_added = 0

# Access to couch db.  From what I can tell, the Databsae is multithreaded.

$DB = CouchRest.database!("")

# The thread pool.


$QUEUE = Queue.new

workers = (1..$POOL_SIZE).map do |i|
  Thread.new("worker #{i}") do
      proc = $QUEUE.deq
    end until $QUEUE.empty?

def enqueue &action
  $QUEUE << action

def parse_browser_page(url)

class String
  def fixup()
    self.gsub("&#039;", "'").gsub("&amp;", "&")

def parse_and_load_game(game_id)

  $game_pages += 1

  page = Hpricot.XML(open("#$BGG/xmlapi/boardgame/#{game_id}?comments=1&stats=1"))

  bg = page.at("//boardgame")

  ratings = bg.at("//ratings")

  comments = (bg/"comment").map do |e|
            "user" => e[:username].fixup,
            "comment" => e.inner_text.fixup

  doc = {
          "_id" => game_id,
          "title" => bg.cat("name[@primary='true']"),
          "description" => bg.cat("description"),
          "designers" => bg.search_to_text("boardgamedesigner"),
          "artists" => bg.search_to_text("boardgameartist"),
          "publishers" => bg.search_to_text("boardgamepublisher"),
          "published" => bg.cati("yearpublished"),
          "categories" => bg.search_to_text("boardgamecategory"),
          "mechanics" => bg.search_to_text("boardgamemechanic"),
          "images" =>{
                  "url", bg.cat("image"),
                  "thumbnailUrl", bg.cat("thumbnail"),
          "players" => {
                  "min" => bg.cati("minplayers"),
                  "max" => bg.cati("maxplayers"),
                  "age" => bg.cati("age")
          "stats" => {
                  "rank" => ratings.cati("rank"),
                  "averageRating" => ratings.cat("average") { |content| content.to_f },
                  "ownedCount" => ratings.cati("owned")
          "comments" => comments

  enqueue do
    $games_added += 1



def process_game(game_id)

    doc = $DB.get(game_id)

    # Found, do nothing

  rescue RestClient::ResourceNotFound

    # Not in the database yet, so fire off a request to parse its page.

    enqueue { parse_and_load_game game_id }


def parse_catalog_page(url)

  puts("[%24s] %4d catalog pages, %4d/%4d games parsed/added (%4d actions queued)" % [Time.now.ctime, $catalog_pages,
       $game_pages, $games_added, $QUEUE.length])

  doc = Hpricot(open(url))

  $catalog_pages += 1

  doc.search("//table[@id='collectionitems']/tr/td[3]//a") do |elem|
    href = elem[:href]
    game_id = href.split('/').last()
    enqueue { process_game game_id }

  next_page_link = doc.at("//a[@title='next page']")

  return unless next_page_link

  next_page = next_page_link[:href]

  # Add this last, to give the other actions a chance to operate.
  # A better mechanism would be a priority-based queue, where the catalog
  # page parse action is lower priority than the other actions.

  enqueue { parse_catalog_page($BGG + next_page) }


# Kick it off

start_page = $ARGV[0] || ($BGG + "/browse/boardgame")

enqueue { parse_catalog_page start_page }

# Wait for all the workers to complete

workers.each { |th| th.join }

puts "BoardGameGeek loader complete."

puts "Queue not empty!" unless $QUEUE.empty?

Key pieces of this is Hpricot to parse HTML and XML, and CouchRest to get the data into CouchDB.

Did I go overboard? I don't think so ... this runs in about two hours, and created a database of nearly 40,000 documents (about 340MB). Using a thread pool just seemed to make sense, since (outside of the XML parsing), every aspect of this is I/O bound: pulling data from BGG or pushing data to CouchDB.

The code is naive about some threading issues and simply crashes if there's an error. Oh well.

Next up: learning how to build views against this data and deciding how to use it all. I'm thinking Cappuccino.


Massimo said...

Man... where the hell you find time to play around like you do?

Jamie Orchard-Hays said...

I was digging CouchDB and Cappuccino at JSConf2009 last weekend. Cool stuff.

Howard said...

Massimo: I find time to skim the surface of things I'd prefer to dive into deeply, and then usually on the weekend, crowded around everything else. Still I have more freedom than most.

kdorff said...

This should be really easy to do using Groovy + Tagsoup and the database of your choice. Maybe even using GORM as I think GORM is now more modularized and could be used in a standard Groovy app with a little research. The Groovy XML parser is great and with Tagsoup scraping HTML is quite easy.

Howard said...

kdorff: If I wanted to dirty my hands with the JVM, I would have used Clojure :-)