Monday, April 27, 2009

Ruby and CouchDB

I've started looking into CouchDB of late; I think it may be a great solution to some of our common issues around collecting "survey" style data from users. Our current approach uses a big nasty hierarchy of Hibernate entities.

I wanted to have some real, and real-world, data. Since I'm a big board gamer, I thought it would be nice to have a local copy of the BoardGameGeek database. I wrote a program that scrapes data from the BGG site and loads it into CouchDB.

#!/usr/bin/ruby

require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'thread'
require 'couchrest'


# Extension to Hpricot

class Hpricot::Elem
  # content at
  # if block given, return value comes from yielding
  # actual content to the block
  def cat(expr)
    result = self.at(expr)

    return nil unless result

    content = result.inner_html.fixup

    return (yield content) if block_given?

    return content
  end

  # content-at converted to int (via to_i)

  def cati(expr)
    self.cat(expr) { |content| content.to_i }
  end

  def search_to_text(expr)
    self.search(expr).map { |e| e.inner_text.fixup}
  end
end

$BGG = "http://boardgamegeek.com"


$catalog_pages = 0
$game_pages = 0
$games_added = 0

# Access to couch db.  From what I can tell, the Databsae is multithreaded.

$DB = CouchRest.database!("http://127.0.0.1:5984/board-game-geek")

# The thread pool.

$POOL_SIZE = 10

$QUEUE = Queue.new

workers = (1..$POOL_SIZE).map do |i|
  Thread.new("worker #{i}") do
    begin
      proc = $QUEUE.deq
      proc.call()
    end until $QUEUE.empty?
  end
end

def enqueue &action
  $QUEUE << action
end


def parse_browser_page(url)
end

class String
  def fixup()
    self.gsub("&#039;", "'").gsub("&amp;", "&")
  end
end

def parse_and_load_game(game_id)

  $game_pages += 1

  page = Hpricot.XML(open("#$BGG/xmlapi/boardgame/#{game_id}?comments=1&stats=1"))

  bg = page.at("//boardgame")

  ratings = bg.at("//ratings")

  comments = (bg/"comment").map do |e|
    {
            "user" => e[:username].fixup,
            "comment" => e.inner_text.fixup
    }
  end

  doc = {
          "_id" => game_id,
          "title" => bg.cat("name[@primary='true']"),
          "description" => bg.cat("description"),
          "designers" => bg.search_to_text("boardgamedesigner"),
          "artists" => bg.search_to_text("boardgameartist"),
          "publishers" => bg.search_to_text("boardgamepublisher"),
          "published" => bg.cati("yearpublished"),
          "categories" => bg.search_to_text("boardgamecategory"),
          "mechanics" => bg.search_to_text("boardgamemechanic"),
          "images" =>{
                  "url", bg.cat("image"),
                  "thumbnailUrl", bg.cat("thumbnail"),
                  },
          "players" => {
                  "min" => bg.cati("minplayers"),
                  "max" => bg.cati("maxplayers"),
                  "age" => bg.cati("age")
          },
          "stats" => {
                  "rank" => ratings.cati("rank"),
                  "averageRating" => ratings.cat("average") { |content| content.to_f },
                  "ownedCount" => ratings.cati("owned")
          },
          "comments" => comments
  }

  enqueue do
    $games_added += 1

    $DB.save_doc(doc)
  end

end


def process_game(game_id)

  begin
    doc = $DB.get(game_id)

    # Found, do nothing

  rescue RestClient::ResourceNotFound

    # Not in the database yet, so fire off a request to parse its page.

    enqueue { parse_and_load_game game_id }

  end
end

def parse_catalog_page(url)

  puts("[%24s] %4d catalog pages, %4d/%4d games parsed/added (%4d actions queued)" % [Time.now.ctime, $catalog_pages,
       $game_pages, $games_added, $QUEUE.length])

  doc = Hpricot(open(url))

  $catalog_pages += 1


  doc.search("//table[@id='collectionitems']/tr/td[3]//a") do |elem|
    href = elem[:href]
    game_id = href.split('/').last()
    enqueue { process_game game_id }
  end

  next_page_link = doc.at("//a[@title='next page']")

  return unless next_page_link

  next_page = next_page_link[:href]

  # Add this last, to give the other actions a chance to operate.
  # A better mechanism would be a priority-based queue, where the catalog
  # page parse action is lower priority than the other actions.

  enqueue { parse_catalog_page($BGG + next_page) }

end


# Kick it off

start_page = $ARGV[0] || ($BGG + "/browse/boardgame")

enqueue { parse_catalog_page start_page }

# Wait for all the workers to complete

workers.each { |th| th.join }

puts "BoardGameGeek loader complete."

puts "Queue not empty!" unless $QUEUE.empty?

Key pieces of this is Hpricot to parse HTML and XML, and CouchRest to get the data into CouchDB.

Did I go overboard? I don't think so ... this runs in about two hours, and created a database of nearly 40,000 documents (about 340MB). Using a thread pool just seemed to make sense, since (outside of the XML parsing), every aspect of this is I/O bound: pulling data from BGG or pushing data to CouchDB.

The code is naive about some threading issues and simply crashes if there's an error. Oh well.

Next up: learning how to build views against this data and deciding how to use it all. I'm thinking Cappuccino.

Wither Derby?

Over lunch last week with co-workers, we were discussing the Oracle buy out of Sun and how that would affect MySQL and what other options there would be. Several people had some experience with PostgreSQL (including myself), but then I brought up Apache Derby ... and no one had used it or event downloaded it, and only a few had even heard of it. I just checked their web site and the last update was in November 2008. When I last played around with it, it seemed to have all the major features of a real database and the advantage of being pure Java (I can't speak to performance).

How many people are using Derby and if not, why not?

Monday, April 20, 2009

Tapestry 5.1 is Beta!

We just had a vote, and Tapestry 5.1 (that is, release 5.1.0.3) is a beta release. At this point, everything is stable except for minor changes to the brandy-new URL rewriting APIs. I've been chasing down bugs and improvements for the 5.1.0.4 release, which will probably be created and voted on later this week. My guess is that 5.1.0.4 will be the final/stable release for 5.1.

I'm already thinking for 5.2. Ditch Maven for Buildr? Strong possibility. Start using Groovy for tests? Why not? Finally get Spring Web Flow integration working ... you betcha! Portlets? Shouldn't be a problem. Release date? I think before the end of the year.

Wednesday, April 15, 2009

Tapestry @ JavaOne: Tapestry State of the Union

I'll be presenting a Birds of a Feather session at JavaOne about Tapestry this year. Part of the session will be a presentation about important Tapestry 5 features that many people aren't aware of, and a roadmap for Tapestry 5.2 and beyond. The balance will be Q & A and a chance to get your Tapestry opinions heard.

The session is June 4th from 6:30 to 7:20PM in Esplanade 301. See you there!

Monday, April 13, 2009

Is it time to abandon IDEA for Eclipse?

I've been a big fan of IDEA since I made the switch about two years ago ... but as I've run each new preview release of IDEA it seems to be getting slower and slower. My Tapestry integration tests take about 5 to 6 minutes to run from the command line; inside IDEA the time to run has recently spiked up: they are now taking 15 - 20 minutes ... I just had a run take nearly 30! My suspicion is that it has something to do with all the console output from the tests. I see the launched Java process (which runs TestNG, Selenium and Jetty) consuming 3% of CPU and IDEA consuming 97%. IDEA is almost completely non-responsive while the tests run. I run both IDEA and the launched Java process with 600M max heap.

Two years ago, IDEA was stable and Eclipse was not runnable on Mac OS X. That's what prompted my first switch to IDEA. I'm now finding Eclipse to run faster on OS X and on Ubuntu. Lately, I feel pulled in the opposite direction: IDEA is quickly becoming too painful to use!

Tapestry on Google Application Engine

Many people have been working on this exciting subject. Jun Tsai has a Tapestry 5.0.18 application running http://ganshane.appspot.com. It looks like he took the Tapestry 5.1.0.3 quickstart archetype and just shifted the version number down to 5.0.18.

Tapestry 5.1 deployments haven't been working, because of the shift in XML parser from generic SAX to Woodstox STAX. An issue has been added, and the Application Engine folks are looking to add the necessary classes to their "white list", but it's not clear when such support will be added.

Conversational State without Spring Web Flow

I've been trying to find time to write the integration between Tapestry and Spring Web Flow ever since I first met Keith Donald at TheServerSide Symposium in 2005. I really, really, really think it will make it into Tapestry 5.2. In the meantime, a simpler solution to many of those problems is conversational state, and that's now available as tapestry-conversations, from the Trails teams.

Sunday, April 12, 2009

Is GZIP compression compatible with XmlHttpRequest?

Does using GZIP on an application/json response stream to an Ajax request cause problems? This question has been coming up a lot in Tapestry, to the point that we had to explicitly turn it off. For some users, on some combination of server and client (not always the ones you'd expect), the response comes out garbled. It's been hard to diagnose, but it almost seems that the Content-Encoding: gzip header is lost, leaving the browser to parse the compressed response as uncompressed (however, this theory is far from confirmed).

None of this occurs with traditional, non-XHR, requests and responses. Is it tied to XHR or to the content type (application/json)? Could this be a bug in Prototype? On Windows only? It's been hard to track down the triggering environment for this, but it appears that my initial guess (a proxy server in the middle) doesn't bear fruit.

Have other app or framework developers confirmed this? If so, what was your approach to addressing this. I know I'd like to see full GZIP compression for all that JSON content.

Friday, April 10, 2009

Using Maven Quickstart Archetype from Eclipse

I'm working on a Tapestry article for JavaWorld, and I wanted to discuss setting up Eclipse with Maven and Jetty support, and creating a new Tapestry project from the latest quickstart archetype.

It ended up being much too large to include in the main article, so I've created a full rundown on the Tapestry360 wiki instead.

This covers where to get the plugins, how do deal with a M2Eclipse bug and the workaround that lets you use the Tapestry quickstart archetype, and some other customizations of the environment.