Monday, April 27, 2009

Ruby and CouchDB

I've started looking into CouchDB of late; I think it may be a great solution to some of our common issues around collecting "survey" style data from users. Our current approach uses a big nasty hierarchy of Hibernate entities.

I wanted to have some real, and real-world, data. Since I'm a big board gamer, I thought it would be nice to have a local copy of the BoardGameGeek database. I wrote a program that scrapes data from the BGG site and loads it into CouchDB.

#!/usr/bin/ruby

require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'thread'
require 'couchrest'


# Extension to Hpricot

class Hpricot::Elem
  # content at
  # if block given, return value comes from yielding
  # actual content to the block
  def cat(expr)
    result = self.at(expr)

    return nil unless result

    content = result.inner_html.fixup

    return (yield content) if block_given?

    return content
  end

  # content-at converted to int (via to_i)

  def cati(expr)
    self.cat(expr) { |content| content.to_i }
  end

  def search_to_text(expr)
    self.search(expr).map { |e| e.inner_text.fixup}
  end
end

$BGG = "http://boardgamegeek.com"


$catalog_pages = 0
$game_pages = 0
$games_added = 0

# Access to couch db.  From what I can tell, the Databsae is multithreaded.

$DB = CouchRest.database!("http://127.0.0.1:5984/board-game-geek")

# The thread pool.

$POOL_SIZE = 10

$QUEUE = Queue.new

workers = (1..$POOL_SIZE).map do |i|
  Thread.new("worker #{i}") do
    begin
      proc = $QUEUE.deq
      proc.call()
    end until $QUEUE.empty?
  end
end

def enqueue &action
  $QUEUE << action
end


def parse_browser_page(url)
end

class String
  def fixup()
    self.gsub("&#039;", "'").gsub("&amp;", "&")
  end
end

def parse_and_load_game(game_id)

  $game_pages += 1

  page = Hpricot.XML(open("#$BGG/xmlapi/boardgame/#{game_id}?comments=1&stats=1"))

  bg = page.at("//boardgame")

  ratings = bg.at("//ratings")

  comments = (bg/"comment").map do |e|
    {
            "user" => e[:username].fixup,
            "comment" => e.inner_text.fixup
    }
  end

  doc = {
          "_id" => game_id,
          "title" => bg.cat("name[@primary='true']"),
          "description" => bg.cat("description"),
          "designers" => bg.search_to_text("boardgamedesigner"),
          "artists" => bg.search_to_text("boardgameartist"),
          "publishers" => bg.search_to_text("boardgamepublisher"),
          "published" => bg.cati("yearpublished"),
          "categories" => bg.search_to_text("boardgamecategory"),
          "mechanics" => bg.search_to_text("boardgamemechanic"),
          "images" =>{
                  "url", bg.cat("image"),
                  "thumbnailUrl", bg.cat("thumbnail"),
                  },
          "players" => {
                  "min" => bg.cati("minplayers"),
                  "max" => bg.cati("maxplayers"),
                  "age" => bg.cati("age")
          },
          "stats" => {
                  "rank" => ratings.cati("rank"),
                  "averageRating" => ratings.cat("average") { |content| content.to_f },
                  "ownedCount" => ratings.cati("owned")
          },
          "comments" => comments
  }

  enqueue do
    $games_added += 1

    $DB.save_doc(doc)
  end

end


def process_game(game_id)

  begin
    doc = $DB.get(game_id)

    # Found, do nothing

  rescue RestClient::ResourceNotFound

    # Not in the database yet, so fire off a request to parse its page.

    enqueue { parse_and_load_game game_id }

  end
end

def parse_catalog_page(url)

  puts("[%24s] %4d catalog pages, %4d/%4d games parsed/added (%4d actions queued)" % [Time.now.ctime, $catalog_pages,
       $game_pages, $games_added, $QUEUE.length])

  doc = Hpricot(open(url))

  $catalog_pages += 1


  doc.search("//table[@id='collectionitems']/tr/td[3]//a") do |elem|
    href = elem[:href]
    game_id = href.split('/').last()
    enqueue { process_game game_id }
  end

  next_page_link = doc.at("//a[@title='next page']")

  return unless next_page_link

  next_page = next_page_link[:href]

  # Add this last, to give the other actions a chance to operate.
  # A better mechanism would be a priority-based queue, where the catalog
  # page parse action is lower priority than the other actions.

  enqueue { parse_catalog_page($BGG + next_page) }

end


# Kick it off

start_page = $ARGV[0] || ($BGG + "/browse/boardgame")

enqueue { parse_catalog_page start_page }

# Wait for all the workers to complete

workers.each { |th| th.join }

puts "BoardGameGeek loader complete."

puts "Queue not empty!" unless $QUEUE.empty?

Key pieces of this is Hpricot to parse HTML and XML, and CouchRest to get the data into CouchDB.

Did I go overboard? I don't think so ... this runs in about two hours, and created a database of nearly 40,000 documents (about 340MB). Using a thread pool just seemed to make sense, since (outside of the XML parsing), every aspect of this is I/O bound: pulling data from BGG or pushing data to CouchDB.

The code is naive about some threading issues and simply crashes if there's an error. Oh well.

Next up: learning how to build views against this data and deciding how to use it all. I'm thinking Cappuccino.

5 comments:

  1. Man... where the hell you find time to play around like you do?

    ReplyDelete
  2. I was digging CouchDB and Cappuccino at JSConf2009 last weekend. Cool stuff.

    ReplyDelete
  3. Massimo: I find time to skim the surface of things I'd prefer to dive into deeply, and then usually on the weekend, crowded around everything else. Still I have more freedom than most.

    ReplyDelete
  4. This should be really easy to do using Groovy + Tagsoup and the database of your choice. Maybe even using GORM as I think GORM is now more modularized and could be used in a standard Groovy app with a little research. The Groovy XML parser is great and with Tagsoup scraping HTML is quite easy.

    ReplyDelete
  5. kdorff: If I wanted to dirty my hands with the JVM, I would have used Clojure :-)

    ReplyDelete

Please note that this is not a support forum for Tapestry. Requests for help will be deleted. Please subscribe to the Tapestry user mailing list if you are in need of support, or contact me directly for professional (for pay) support.

Spammers: Don't bother. I delete your comments and it's a waste of time for both of us. 垃圾邮件发送者:不要打扰。我删除您的评论和它的时间对我们双方的浪费