Tuesday, May 05, 2009

Clojure: Turns out, parsing is somewhat iterative, not functional

... at least for me. I'm working on some more real-world code in Clojure, and I'm struggling with turning a stream of tokens (representing XML content parsed from a stream) into a DOM-like hierarchy. The tokens represent open tags, attributes, close tags, and text content.

I've written dozens of recursive decent compilers in the past, in Pascal, C, PL/1 and Java. The problem is, that style of compiler works by consuming a stream (of characters or tokens) ... a destructive operation.

Logically, I want code such as:

List parseTemplate(tokens)
{
  result = []

  while (tokens not empty)
  {
    token = pop(tokens)
    if (token.type == text)
    {
      add text token to result;
      continue;
    }

    if (token.type == element)
    {
      e = parse_element (token, tokens)
      add e to result;
      continue;
  }
}

Now, in Clojure I can write a (parse-element) function that takes a bunch of tokens, and it can work its way through the tokens using Clojure's (loop) and (recur) forms ... but to the calling function, tokens will not have changed.

One approach might be to pass along some pretty complex state; literally a state machine (inside a map) so that each function can keep push and popping values into and out of the state before invoking each. This would involve recursing on each token. Anyway, I have a rough idea of what that would look like and it seems complicated.

Instead, I've been pounding my head, looking at Clojure's support for monads. Monad are a way to combine certain types of functions together in a way that appears as if there were mutable state. You provide some base types of monadic functions and, within the context of a monad definition, it streamlines the way those basic monadic functions can be strung together to form more complex monadic functions.

Still trying to wrap my head around the concept however. It gets really confusing because the real power of monads appears when the monadic values, the things returned from a monadic function, are themselves functions. Yep, that's where I get lost too.

Update: passed the initial Monad hurdle; I'm finally beginning to understand them, perhaps even enough to explain them to other people.

5 comments:

  1. Do you know about this?: http://kotka.de/projects/clojure/parser.html

    I haven't used it (indeed, I don't use Clojure) but it looks like the obvious way to parse in Clojure.

    ReplyDelete
  2. Also it looks like the solution you feel most comfortable with, is similar (exactly?) how the clojure.xml parser works.

    ReplyDelete
  3. Actually, clojure.xml/parse runs a SAXParser instance and it's full of vars, (set!), (binding), etc.

    I actually used that as the basis for a more namespace-aware parser that emits the XML token stream.

    I could probably accomplish more of what I want in my parser, but I really want to isolate the non-functional, SAX-based portion and keep the rest pure Clojure, and I want the eventual token-to-tree parser to be truly functional. If I can wrap my head around the state and maybe monads, I think I'll have something.

    ReplyDelete
  4. i don't know if this applies to Clojure but is a very easy reading about Monadic Parsing (asumes no Monads knowledge )

    http://cs.nott.ac.uk/~gmh/monparsing.pdf

    ReplyDelete
  5. I think you're trying to use parser combinators, of which Kotka is an implementation for Clojure.

    If you're on the JVM Scala provides good support, though not very well documented, for parser combinators.

    This is a very good introduction:
    http://www.codecommit.com/blog/scala/formal-language-processing-in-scala

    ReplyDelete

Please note that this is not a support forum for Tapestry. Requests for help will be deleted. Please subscribe to the Tapestry user mailing list if you are in need of support, or contact me directly for professional (for pay) support.

Spammers: Don't bother. I delete your comments and it's a waste of time for both of us. 垃圾邮件发送者:不要打扰。我删除您的评论和它的时间对我们双方的浪费