Tapestry Training -- From The Source

Let me help you get your team up to speed in Tapestry ... fast. Visit howardlewisship.com for details on training, mentoring and support!

Wednesday, June 06, 2012

Synchronized Considered Harmful

New flash: concurrency is hard. Any time you have mutable data and multiple threads, you are just asking for abuse, and synchronized is simply not going to be your savior ... but the concurrency facilities of JDK 1.5 just might be.

I was recently contacted by a client who was load testing their Tapestry 5.3.3 application; they were using Tomcat 6.0.32 with 500 worker threads, on a pretty beefy machine: Intel Xeon X7460 @ 2.66Ghz, OpenJDK 64-Bit Server VM (14.0-b16, mixed mode). That's a machine with six cores, and 16 MB of L2 cache.

For all that power, they were tapping out at 450 requests per second. That's not very good when you have 500 worker threads ... it means that you've purchased memory and processing power just to see all those worker threads block, and you get to see your CPU utilization stay low. When synchronization is done properly, increasing the load on the server should push CPU utilization to 100%, and response time should be close to linear with load (that is to say, all the threads should be equally sharing the available processing resources) until the hard limit is reached.

Fortunately, these people approached me not with a vague performance complaint, but with a detailed listing of thread contention hotspots.

The goal with Tapestry has always been to build the code right initially, and optimize the code later if needed. I've gone through several cycles of this over the past couple of years, optimizing page construction time, or memory usage, or throughput performance (as here). In general, I follow Brian Goetz's advice: write simple, clean, code and let the compiler and Hotspot figure out the rest.

Another piece of advice from Brian is that "uncontested synchronized calls are very cheap". Many of the hotspots located by my client were, in fact, simple synchronized methods that did some lazy initialization. Here's an example:

In this example, getting the messages can be relatively time consuming and expensive, and is often not necessary at all. That is, in most instances of the class, the getMessages() method is never invoked. There were a bunch of similar examples of optional things that are often not needed ... but can be heavily used in the cases where they are used.

It turns out that "uncontested" really means uncontested: You better be sure that no two threads are ever hitting synchronized methods of the same instance at the same time. I chatted with Brian at the Hacker Bed & Breakfast about this, and he explained that you can quickly go from "extremely cheap" to "asymptotically expensive" when there's any potential for contention. The synchronized keyword is very limited in one area: when exiting a synchronized block, all threads that are waiting for that lock must be unblocked, but only one of those threads gets to take the lock; all the others see that the lock is taken and go back to the blocked state. That's not just a lot of wasted processing cycles: often the context switch to unblock a thread also involves paging memory off the disk, and that's very, very, expensive.

Enter ReentrantReadWriteLock: this is an alternative that allows any number of readers to share a lock, but only a single writer. When a thread attempts to acquire the write lock, the thread blocks until all reader threads have released the read lock. The cost of managing the ReentrantReadWriteLock's state is somewhat higher than synchronized, but has the huge advantage of letting multiple reader threads operate simultaneously. That means much, much higher throughput.

In practice, this means you must acquire the shared read lock to look at a field, and acquire the write lock in order to change the field.

ReentrantReadWriteLock is smart about only waking the right thread or threads when either the read lock or the write lock is released. You don't see the same thrash you would with synchronized: if a thread is waiting for the write lock, and another thread releases it, ReentrantReadWriteLock will (likely) just unblock the one waiting thread.

Using synchronized is easy; with an explicit ReentrantReadWriteLock there's a lot more code to manage:

I like to avoid nested try ... finally blocks, so I broke it out into seperate methods.

Notice the "lock dance": it is not possible to acquire the write lock if any thread, even the current thread, has the read lock. This opens up a tiny window where some other thread might pop in, grab the write lock and initialize the messages field. That's why it is desirable to double check, once the write lock has been acquired, that the work has not already been done.

Also notice that things aren't quite symmetrical: with ReentrantReadWriteLock it is allowable for the current thread to acquire the read lock before releasing the write lock. This helps to minimize context switches when the write lock is released, though it isn't expressly necessary.

Is the conversion effort worth it? Well, so far, simply by converting synchronized to ReentrantReadWriteLock, adding a couple of additional caches (also using ReentrantReadWriteLock), plus some work optimizing garbage collection by expanding the eden space, we've seen some significant improvements; from 450 req/sec to 2000 req/sec ... and there's still a few minor hotspots to address. I think that's been worth a few hours of work!

6 comments:

jed said...

Yeah, lazy initialisation is painful and easy to get wrong. I got so sick of seeing people get it wrong in various ways I wrote a utility class that encapsulated the right way to do it along with the fastest access (volatile semantics basically) possible.

Declaration is basically:

private final LazyReference = new LazyReference() {
protected abstract Message create() {
return elementResources.getMessages(componentModel);
}
};

and usage is

Messages m = messages.get();

some docs: https://labs.atlassian.com/wiki/display/CONCURRENT/LazyReference+and+ResettableLazyReference
src is now on bitbucket: https://bitbucket.org/atlassian/atlassian-util-concurrent/
for instance: https://bitbucket.org/atlassian/atlassian-util-concurrent/src/492a00b90972/src/main/java/com/atlassian/util/concurrent/LazyReference.java

There is also a form of memoizing Supplier that supports timed-out caching as well:
https://bitbucket.org/atlassian/atlassian-util-concurrent/src/492a00b90972/src/main/java/com/atlassian/util/concurrent/Lazy.java

Unknown said...

Why not use a double check like this:

public Map getMessages()
if(messages == null){
synchronized(mutex){
if(messages == null){
//load messages here
}
}
}
return messages;
}

Unknown said...

@Unknown

That's classic double check locked code that was popularized, incorrectly, around 2001. It does not ensure that all threads see a consistent view of the messages field.

Unknown said...

@Jed

LazyReference looks cool; you should do a follow on blog post about it. In my circumstance, one of the issues is to reduce the number of objects in the system: in some of my client's larger applications, they have dozens (sometimes hundreds) of pages, each with thousands of components, and multiple localization. Each component consists of a number of objects, and having excess objects can, really, truly, add up.

I should do some testing about volatile (with a single write lock) vs. the technique discussed in the main blog posting. I suspect volatile approach would be faster for lazy initialization cases (but not for cases where the field changes value more than once).

Unknown said...

Just pulled out Goetz's book. @Unknown: see page 349 (your example is marked "don't do this"). This is DCL ("double checked locking"), broken before 1.5, works in 1.5+ only if the field (messages) is volatile.

Unknown said...

A note: I chose the getMessages() example for simplicity; further research has shown that it is safe to dispence with both synchronized and the locks for this particular field: It is write-once, and if multiple threads invoke elementResources.getMessages(), they will receive the same value: it is cached.

However, many of the other methods of InternalComponentResourcesImpl require the full-blown read/write lock technique for correctness and scalability.