Monday, April 27, 2009

Ruby and CouchDB

I've started looking into CouchDB of late; I think it may be a great solution to some of our common issues around collecting "survey" style data from users. Our current approach uses a big nasty hierarchy of Hibernate entities.

I wanted to have some real, and real-world, data. Since I'm a big board gamer, I thought it would be nice to have a local copy of the BoardGameGeek database. I wrote a program that scrapes data from the BGG site and loads it into CouchDB.

#!/usr/bin/ruby

require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'thread'
require 'couchrest'


# Extension to Hpricot

class Hpricot::Elem
  # content at
  # if block given, return value comes from yielding
  # actual content to the block
  def cat(expr)
    result = self.at(expr)

    return nil unless result

    content = result.inner_html.fixup

    return (yield content) if block_given?

    return content
  end

  # content-at converted to int (via to_i)

  def cati(expr)
    self.cat(expr) { |content| content.to_i }
  end

  def search_to_text(expr)
    self.search(expr).map { |e| e.inner_text.fixup}
  end
end

$BGG = "http://boardgamegeek.com"


$catalog_pages = 0
$game_pages = 0
$games_added = 0

# Access to couch db.  From what I can tell, the Databsae is multithreaded.

$DB = CouchRest.database!("http://127.0.0.1:5984/board-game-geek")

# The thread pool.

$POOL_SIZE = 10

$QUEUE = Queue.new

workers = (1..$POOL_SIZE).map do |i|
  Thread.new("worker #{i}") do
    begin
      proc = $QUEUE.deq
      proc.call()
    end until $QUEUE.empty?
  end
end

def enqueue &action
  $QUEUE << action
end


def parse_browser_page(url)
end

class String
  def fixup()
    self.gsub("&#039;", "'").gsub("&amp;", "&")
  end
end

def parse_and_load_game(game_id)

  $game_pages += 1

  page = Hpricot.XML(open("#$BGG/xmlapi/boardgame/#{game_id}?comments=1&stats=1"))

  bg = page.at("//boardgame")

  ratings = bg.at("//ratings")

  comments = (bg/"comment").map do |e|
    {
            "user" => e[:username].fixup,
            "comment" => e.inner_text.fixup
    }
  end

  doc = {
          "_id" => game_id,
          "title" => bg.cat("name[@primary='true']"),
          "description" => bg.cat("description"),
          "designers" => bg.search_to_text("boardgamedesigner"),
          "artists" => bg.search_to_text("boardgameartist"),
          "publishers" => bg.search_to_text("boardgamepublisher"),
          "published" => bg.cati("yearpublished"),
          "categories" => bg.search_to_text("boardgamecategory"),
          "mechanics" => bg.search_to_text("boardgamemechanic"),
          "images" =>{
                  "url", bg.cat("image"),
                  "thumbnailUrl", bg.cat("thumbnail"),
                  },
          "players" => {
                  "min" => bg.cati("minplayers"),
                  "max" => bg.cati("maxplayers"),
                  "age" => bg.cati("age")
          },
          "stats" => {
                  "rank" => ratings.cati("rank"),
                  "averageRating" => ratings.cat("average") { |content| content.to_f },
                  "ownedCount" => ratings.cati("owned")
          },
          "comments" => comments
  }

  enqueue do
    $games_added += 1

    $DB.save_doc(doc)
  end

end


def process_game(game_id)

  begin
    doc = $DB.get(game_id)

    # Found, do nothing

  rescue RestClient::ResourceNotFound

    # Not in the database yet, so fire off a request to parse its page.

    enqueue { parse_and_load_game game_id }

  end
end

def parse_catalog_page(url)

  puts("[%24s] %4d catalog pages, %4d/%4d games parsed/added (%4d actions queued)" % [Time.now.ctime, $catalog_pages,
       $game_pages, $games_added, $QUEUE.length])

  doc = Hpricot(open(url))

  $catalog_pages += 1


  doc.search("//table[@id='collectionitems']/tr/td[3]//a") do |elem|
    href = elem[:href]
    game_id = href.split('/').last()
    enqueue { process_game game_id }
  end

  next_page_link = doc.at("//a[@title='next page']")

  return unless next_page_link

  next_page = next_page_link[:href]

  # Add this last, to give the other actions a chance to operate.
  # A better mechanism would be a priority-based queue, where the catalog
  # page parse action is lower priority than the other actions.

  enqueue { parse_catalog_page($BGG + next_page) }

end


# Kick it off

start_page = $ARGV[0] || ($BGG + "/browse/boardgame")

enqueue { parse_catalog_page start_page }

# Wait for all the workers to complete

workers.each { |th| th.join }

puts "BoardGameGeek loader complete."

puts "Queue not empty!" unless $QUEUE.empty?

Key pieces of this is Hpricot to parse HTML and XML, and CouchRest to get the data into CouchDB.

Did I go overboard? I don't think so ... this runs in about two hours, and created a database of nearly 40,000 documents (about 340MB). Using a thread pool just seemed to make sense, since (outside of the XML parsing), every aspect of this is I/O bound: pulling data from BGG or pushing data to CouchDB.

The code is naive about some threading issues and simply crashes if there's an error. Oh well.

Next up: learning how to build views against this data and deciding how to use it all. I'm thinking Cappuccino.

5 comments:

Massimo said...

Man... where the hell you find time to play around like you do?

Jamie Orchard-Hays said...

I was digging CouchDB and Cappuccino at JSConf2009 last weekend. Cool stuff.

Howard said...

Massimo: I find time to skim the surface of things I'd prefer to dive into deeply, and then usually on the weekend, crowded around everything else. Still I have more freedom than most.

kdorff said...

This should be really easy to do using Groovy + Tagsoup and the database of your choice. Maybe even using GORM as I think GORM is now more modularized and could be used in a standard Groovy app with a little research. The Groovy XML parser is great and with Tagsoup scraping HTML is quite easy.

Howard said...

kdorff: If I wanted to dirty my hands with the JVM, I would have used Clojure :-)

Blog Archive