Monday, July 30, 2012

A little Gotcha with asynch and Streams

I stumbled across a little gotcha using async with Node.js Streams: you can easily corrupt your output if you are not careful.

Node.js Streams are an abstraction of Unix pipes; they let you push or pull data a little bit at a time, never keeping more in memory than its needed. async is a library used to organize all the asynchronous callbacks used in node applications without getting the kind of "Christmas Tree" deep nesting of callbacks that can occur too easily.

I'm working on a little bit of code to pull an image file, stored in MongoDB GridFS, scale the image using ImageMagick, then stream the result down to the browser.

My first pass at this didn't use ImageMagick or streams, and worked perfectly ... but as soon as I added in the use of async (even before adding in ImageMagick), I started getting broken images in the browser, meaning that my streams were getting corrupted.

Before adding async, my code was reasonable:

However, I knew I was going to add a few new steps here to pipe the file content through ImageMagick; that's when I decided to check out the async module.

The logic for handling this request is a waterfall; each step kicks off some work, then passes data to the next step via an asynchronous callback. The async library calls the steps "tasks"; you pass an array of these tasks to async.waterfall(), along with the end-of-waterfall callback. This special callback may be passed an error provided by any task, or the final result from the final task.

With waterfall(), each task is passed a special callback function. If the callback function is passed a non-null error as the first parameter, then remaining tasks are skipped, and the final result handler is invoked immediately, to handle the error.

Otherwise, you pass null as the first parameter, plus any additional result values. The next task is passed the result values, plus the next callback. It's all very clever.

My first pass was to duplicate the behavior of my original code, but to do so under the async model. That means lots of smaller functions; I also introduced an extra step between getting the opened file and streaming its contents to the browser. The extra step is intended for later, where ImageMagick will get threaded in.

The code, despite the extra step, was quite readable:

My style is to create local variables with each function; so openFile kicks off the process; once the file has been retrieved from MongoDB, the readFileContents task will be invoked ... unless there's an error, in which case errorCallback gets invoked immediately.

Inside readFileContents we convert the file to a stream with file.stream(true) (the true means to automatically close the stream once all of the file contents have been read from GridFS).

streamToClient comes next, it takes that stream and pipes it down to the browser via the res (response) object.

So, although its now broken up into more small functions, the logic is the same, as expressed on the very last line: open the file, read its contents as a stream, stream the data down to the client.

However, when I started testing this before moving on to add the image scaling step, it no longer worked. The image data was corrupted. I did quite a bit of thrashing: adding log messages, looking at library source, guessing, and experimenting (and I did pine for a real debugger!).

Eventually, I realized it came down to this bit of code from the async module:

The code on line 7 is the callback function passed to each task; notice that once it decides what to do, on line 21 it defers the execution until the "next tick".

The root of the problem was simply that the "next tick" was a little too late. By the time the next tick came along, and streamToClient got invoked, the first chunk of data had already been read from MongoDB ... but since the call to pipe() had not executed yet, it was simply discarded. The end result was that the stream to the client was missing a chunk at the beginning, or even entirely empty.

The solution was to break things up a bit differently, so that the call to file.stream() happens inside the same task as the call to stream.pipe().

So that's our Leaky Abstraction for today; what looked like an immediate callback was deferred just enough to change the overall behavior. And that, in Node, anything that can be deferred, will be deferred, since that makes the overall application that much zippier.

Monday, July 02, 2012

You Cannot Correctly Represent Change Without Immutability

The title of this blog post is a quote by Rich Hickey, talking about the Datomic database. Its a beautiful statement, at once illuminating and paradoxical. It drives at the heart of the design of both Clojure and Datomic, and embraces the difference between identity and state.

What is change? That seems like an obvious question, but my first attempt at defining it was "some change to a quantifiable set of qualities about some object." Woops, I used change recursively there ... that's not going to help.

In the real world, things change in ways we can observe; the leaf falls from the tree, the water in the pot boils, the minute hand moves ever forward.

How do we recognize that things have changed? We can, in our memories, remember a prior state. We remember when the leaf was green and attached to a branch; we remember when the water came out of the tap, and we remember looking at the clock a few minutes ago. We can hold both states in our mind at the same time, and compare them.

How do we represent change in traditional, object-oriented technologies? Well, we have fields (or columnus) and we change the state in place:

  • leaf.setColor(BROWN).detachFromTree()
  • UPDATE LEAVES SET COLOR = 'BROWN' WHERE ID = ?ID
  • water.setTemperature(212)
  • or we see time advancing via System.currentTimeMillis()

Here's the challenge: given an object, how do you ask it about its prior state? Can you ask leaf.getTreeDetachedFrom()? Generally, you can't unless you've gone to some herculean effort: the new state overwrites the old state in place.

When Rich talks about conflating state with identity, this is what he means. With the identity and state conflated, then after the change in state, the leaf will now-have-always-been fallen from the tree, the water will now-have-always-been boiled, and the clock will now-eternally be at 9:49 AM.

What Clojure does in memory, and Datomic does in the database, is split identity and state. We end up with leaf1 as {:id "a317a439-50bb-4d37-838a-c8eef289e22f" :color :green :attached-to maple-tree} and leaf2 as {:id "a317a439-50bb-4d37-838a-c8eef289e22f" :color :brown :on-ground true}. The id is the same, but the other attributes can vary.

With immutability, changes in state are really new objects; a new version, or "quantifiable set of qualities", that does not affect the original version. It is possible to compare two different iterations of the same object to see the "deltas". In Datomic, you even have more meta-data about when such state changes occur, what else changed within the same transaction, and who is the responsible party for that transaction.

The essence here is not to think of an object as a set of slots you can put new data into. Instead, think of it as a time-line of different configurations of the object. The fact that late in the time-line, the leaf has fallen from the tree does not affect the fact that earlier on the time-line, the leaf was a bud on a branch. The identity of the leaf transcends all those different states.

In the past, I've built systems that required some of the features that Datomic provides; for example, being able to reconstruct the state of the entire database at some prior time, and strong auditing of what changes occurred to what entities at a specific time (or transaction). Rich knows that others have hit this class of problem; part of his selling point is to ask "and who really understands that query" (the one that reconstructs prior state). He knows people have done it, but he also knows no one is very happy about its performance, correctness, or maintainability ... precisely because traditional databases don't understand mutability: they live in that eternal-now, and drag your application into the same world view.

That's why I'm excited by Datomic; it embraces this key idea: separate identity from state by leveraging immutability and from the ensuing design, much goodness is an automatic by-product. Suddenly, we start seeing much of what we take as dogma when developing database-driven applications to be kludges on top of an unstable central idea: mutable state.

For example: read transactions are a way to gain stable view of interrelated data even as the data is being changed (in place); with Datomic, you always have a stable view of all data, because you operate on an immutable view of the entire database at some instance in time. Other transactions may add, change, or replace Datoms in the database, but any code that is reading from the database will be completely unaware of those changes, even as they lazily navigate around the entire database.