Tapestry Central: Q: What's your toughest coding challenge? A:...

Monday, March 05, 2007

Q: What's your toughest coding challenge? A:...

I've seen this question before on resumes and other people's blogs. I've never had a specific example that worked well. Previously, I've had vague stories of struggling with, say, Javassist ... but the best story here would involve my own code.

A few weeks ago, I noticed that some of my tests, the integration tests, didn't run very well on the Tapestry Bamboo server (our continuous integration server). Occasionally, even when developing on my Mac, I'd see these anomalous errors.

When things work in unit tests and fail in integration tests, one of the first places to look is for concurrency issues. Tapestry has a lot of code related to concurrency: synchronizing, caching, clearing of caches, on-demand instantiation, the works.

I started seeing things that made me question Java reality. For instance, this method is only invoked from a dynamic proxy, from a synchronized method, and the proxy only invokes the method once, yet I was seeing multiple calls:

public class OneShotServiceCreator implements ObjectCreator
{
    private final ServiceDef _serviceDef;

    private final ObjectCreator _delegate;

    private boolean _locked;

    public OneShotServiceCreator(ServiceDef serviceDef, ObjectCreator delegate)
    {
        _serviceDef = serviceDef;
        _delegate = delegate;
    }

    /**
     * We could make this method synchronized, but in the context of creating a service for a proxy,
     * it will already be synchronized (inside the proxy).
     */
    public Object createObject()
    {
        if (_locked)
            throw new IllegalStateException(IOCMessages.recursiveServiceBuild(_serviceDef));

        _locked = true;

        return _delegate.createObject();
    }

}

The code that calls this looks something like:

private synchronized SomeService _delegate()
{
  if (_delegate == null) {
    _delegate = (SomeService) _creator.createObject();
    _creator = null;
  }

  return _delegate;
}

You can see how this would tend to drive you a bit crazy. Eventually, I realized what was going on ... sometimes a runtime exception (an OutOfMemoryError) would be thrown, so the proxy would never complete _delegate(), and would (later, possibly in a different thread), re-invoke this the createObject() method ... and fail, because the lock was set. Solution:

     */
    public Object createObject()
    {
        if (_locked)
            throw new IllegalStateException(IOCMessages.recursiveServiceBuild(_serviceDef));

        // Set the lock, to ensure that recursive service construction fails.

        _locked = true;

        try
        {
            return _delegate.createObject();
        }
        catch (RuntimeException ex)
        {
            _log.error(IOCMessages.serviceConstructionFailed(_serviceDef, ex), ex);

            // Release the lock on failure; the service is now in an unknown state, but we may
            // be able to continue from here.

            _locked = false;

            throw ex;
        }

    }

I also started seeing bizarre error messages, like: Unable to resolve page 'MyPage' to a component class name. Available page names: Start, MyPage, YourPage.. What the hell was going on there ... was something modifying the underlying CaseInsensitiveMap? But the access methods are synchronized.

Once you start seeing bizarre concurrent behavior, and after listening to a few Brian Goetz talks about concurrency (not to mention his great book) ... well, you can start getting paranoid. Maybe its a bug in the JVM (trust me, it's never going to be a bug in the JVM). Maybe I'm not understanding access to effectively immutable objects outside of synchronized blocks (that's a bit more reasonable).

Then I saw something even more bizarre:

    public  T getService(String serviceId, Class serviceInterface, Module module)
    {
        notBlank(serviceId, "serviceId");
        notNull(serviceInterface, "serviceInterface");
        // module may be null.

        ServiceDef def = _moduleDef.getServiceDef(serviceId);

        if (def == null) throw new IllegalArgumentException(IOCMessages.missingService(serviceId));

        if (notVisible(def, module))
            throw new RuntimeException(IOCMessages.serviceIsPrivate(serviceId));

        Object service = findOrCreate(def);

        try
        {
            return serviceInterface.cast(service);
        }
        catch (ClassCastException ex)
        {
            // This may be overkill: I don't know how this could happen
            // given that the return type of the method determines
            // the service interface.

            throw new RuntimeException(IOCMessages.serviceWrongInterface(serviceId, def
                    .getServiceInterface(), serviceInterface));
        }
    }

See the comment about how that code is not reachable? It was reached! It took a while, but I found out that sometimes I was getting back the wrong ServiceDef object out of the ModuleDef. Rarely, but sometimes.

Then it struck me ... all of this strange behavior was traced to my CaseInsensitiveMap implementation. Sure it has a 100% code coverage test suite ... but when I double checked the code I found some sloppiness: it uses a couple of instance variables as a scratch pad while searching. This means that, under the right timing, even reads of the effectively immutable Map by different threads would interfere with each other, with Thread A getting the result from Thread B's query. Here's a diff of the solution, which basically moved those two scratch variables into their own inner class.

Things are working perfectly (for now), which is a great relief. This bug hunt was a distraction from other things, but better to tackle it now than later. This is also a good example of the need for a continuous integration server ... the fact that the server is on a different OS and JDK ensured that the tests would fail pretty consistently on the CI server even though they mostly worked on my Mac. Brian states this as well is his book ... test on all sorts of hardware with all sorts of memory configurations, because you don't know which one aligns with your bugs and brings them out into the open.

6 comments:

David Illsley said...: Hi,
I've recently been encouraged to look at static analysis. I'd be really interested to know if FindBugs or similar would have picked up on the potential problem?

Looks like a difficult problem, well done getting it fixed.
David; 11:23 PM
David Peterson said...: Hang on, an OutOfMemoryError is an Error, not a RuntimeException!; 12:18 AM
Gabriel Kastenbaum said...: Thanks! Very interesting!!

If you like weird bugs you can also have a scratch on classloading exception :-); 1:01 AM
phoet said...: funny, i ordered that book last saturday :-) http://www.briangoetz.com/

would you might changing the css-color of your not-visited-links to something that differs from the rest of the text, or underline them?; 2:07 AM
Dhanji R. Prasanna said...: a bit off topic. Have you seen google-guice howard? their method DI looks suspicious similar to Tap4 @InjectObject annotations =)

Everyone is singing its praises as people do with anything google. It is supposedly 100x faster than spring (see wiki-page on comparison).

How does it compare to tap5 ioc? I seem to recall lots of annotation-based injection there. Any plans to release tap5-ioc outside of tapestry (or are u sticking with hivemind 2)?

Dhanji.; 10:29 PM
Howard Lewis Ship said...: I've been chatting with "Crazy" Bob Lee, but since each of us in convinced we have the coolest thing since sliced bread, not much is happening of it.

The cool part of Spring is not its IoC container, it's all the stuff you connect together with the IoC container.; 10:55 PM