I've seen this question before on resumes and other people's blogs. I've never had a specific example that worked well. Previously, I've had vague stories of struggling with, say, Javassist ... but the best story here would involve my own code.
A few weeks ago, I noticed that some of my tests, the integration tests, didn't run very well on the Tapestry Bamboo server (our continuous integration server). Occasionally, even when developing on my Mac, I'd see these anomalous errors.
When things work in unit tests and fail in integration tests, one of the first places to look is for concurrency issues. Tapestry has a lot of code related to concurrency: synchronizing, caching, clearing of caches, on-demand instantiation, the works.
I started seeing things that made me question Java reality. For instance, this method is only invoked from a dynamic proxy, from a synchronized method, and the proxy only invokes the method once, yet I was seeing multiple calls:
public class OneShotServiceCreator implements ObjectCreator { private final ServiceDef _serviceDef; private final ObjectCreator _delegate; private boolean _locked; public OneShotServiceCreator(ServiceDef serviceDef, ObjectCreator delegate) { _serviceDef = serviceDef; _delegate = delegate; } /** * We could make this method synchronized, but in the context of creating a service for a proxy, * it will already be synchronized (inside the proxy). */ public Object createObject() { if (_locked) throw new IllegalStateException(IOCMessages.recursiveServiceBuild(_serviceDef)); _locked = true; return _delegate.createObject(); } }
The code that calls this looks something like:
private synchronized SomeService _delegate() { if (_delegate == null) { _delegate = (SomeService) _creator.createObject(); _creator = null; } return _delegate; }
You can see how this would tend to drive you a bit crazy. Eventually, I realized what was going on ... sometimes a runtime exception (an OutOfMemoryError) would be thrown, so the proxy would never complete _delegate(), and would (later, possibly in a different thread), re-invoke this the createObject() method ... and fail, because the lock was set. Solution:
*/ public Object createObject() { if (_locked) throw new IllegalStateException(IOCMessages.recursiveServiceBuild(_serviceDef)); // Set the lock, to ensure that recursive service construction fails. _locked = true; try { return _delegate.createObject(); } catch (RuntimeException ex) { _log.error(IOCMessages.serviceConstructionFailed(_serviceDef, ex), ex); // Release the lock on failure; the service is now in an unknown state, but we may // be able to continue from here. _locked = false; throw ex; } }
I also started seeing bizarre error messages, like: Unable to resolve page 'MyPage' to a component class name. Available page names: Start, MyPage, YourPage.
. What the hell was going on there ... was something modifying the underlying CaseInsensitiveMap? But the access methods are synchronized.
Once you start seeing bizarre concurrent behavior, and after listening to a few Brian Goetz talks about concurrency (not to mention his great book) ... well, you can start getting paranoid. Maybe its a bug in the JVM (trust me, it's never going to be a bug in the JVM). Maybe I'm not understanding access to effectively immutable objects outside of synchronized blocks (that's a bit more reasonable).
Then I saw something even more bizarre:
publicT getService(String serviceId, Class serviceInterface, Module module) { notBlank(serviceId, "serviceId"); notNull(serviceInterface, "serviceInterface"); // module may be null. ServiceDef def = _moduleDef.getServiceDef(serviceId); if (def == null) throw new IllegalArgumentException(IOCMessages.missingService(serviceId)); if (notVisible(def, module)) throw new RuntimeException(IOCMessages.serviceIsPrivate(serviceId)); Object service = findOrCreate(def); try { return serviceInterface.cast(service); } catch (ClassCastException ex) { // This may be overkill: I don't know how this could happen // given that the return type of the method determines // the service interface. throw new RuntimeException(IOCMessages.serviceWrongInterface(serviceId, def .getServiceInterface(), serviceInterface)); } }
See the comment about how that code is not reachable? It was reached! It took a while, but I found out that sometimes I was getting back the wrong ServiceDef object out of the ModuleDef. Rarely, but sometimes.
Then it struck me ... all of this strange behavior was traced to my CaseInsensitiveMap implementation. Sure it has a 100% code coverage test suite ... but when I double checked the code I found some sloppiness: it uses a couple of instance variables as a scratch pad while searching. This means that, under the right timing, even reads of the effectively immutable Map by different threads would interfere with each other, with Thread A getting the result from Thread B's query. Here's a diff of the solution, which basically moved those two scratch variables into their own inner class.
Things are working perfectly (for now), which is a great relief. This bug hunt was a distraction from other things, but better to tackle it now than later. This is also a good example of the need for a continuous integration server ... the fact that the server is on a different OS and JDK ensured that the tests would fail pretty consistently on the CI server even though they mostly worked on my Mac. Brian states this as well is his book ... test on all sorts of hardware with all sorts of memory configurations, because you don't know which one aligns with your bugs and brings them out into the open.