Pate's slow load time case
is a really interesting one. I have to say I was dumbstruck about the performance issues he was reporting, especially when he started telling me about his configuration. I thought he must have been feeding me measurements and configuration parameters from two different systems. It just did not make any sense.
Let's review, since Pate and I chatted on IRC quite a bit and you guys probably missed that. From Pate's report on Java-monitor, it was clear that his system was swamped in stop-the-world garbage collects. When I asked a bit more, Pate told me on IRC that his memory configuration was "-Xms512M -Xmx4096M -XX:MaxPermSize=1024M". Well, that's nice and sane for a large production system.
To my confusion he then told me that Java-monitor reported that only 20% of the heap memory was in use. I was unable to fit that into the picture, because Sun's JVM's do not use the stop-the-world garbage collectors until their old generations have filled up to well over 70% of their capacity. And yet, here was a JVM doing stop-the-world garbage collects at 20% heap memory capacity. Bizarre.
On top of this, he reported some really, really strange performance figures for test JSP's he as running. Putting a "hello world" JSP in an empty web app on the same server would be served in a few milliseconds, as expected. Putting that very same JSP into their own application's web app would give response times of several hundreds of milliseconds. Pate noted that there were some servlet filters that would match the JSP, so that explained what Java code was responsible for the slowdown.
As it turns out, the answer was really simple: the application was doing it to itself. An enterprising developer had decided to put in a call to System.gc()
, the call in Java to suggest that the JVM do a garbage collect.
This call does not look so bad in itself. He'd placed it after pruning a large buffer, so he may have thought that that would be a good time to give the JVM a chance to take out the trash. What the developer did not realize was that System.gc() triggers a stop-the-world garbage collect cycle.
Now why is that so bad? It's only a few milliseconds overhead on each HTTP request, right? What harm can 100 ms do when there is lots of XML marshalling also going on? Well, the devil is in the fact that these 100 ms are not imposed on the current thread alone. Since you've just caused a stop-the-world garbage collect, all processing threads have to wait for 100 ms. What's worse, they all call System.gc() too. So each thread gets (number-of-threads * 100 ms) of GC overhead.
As if that was not enough: the number of threads that are in the process of handling an HTTP request grows larger and larger as requests are not processed quickly enough. More requests processing means more requests arriving and starting and calling System.gc(), thus causing yet more delays.
I feel sorry for Pate for having to work weekends to resolve this, but I am glad that he did. I learned a few very valuable lessons this week. I owe him a beer for that.
PS. Here is some reading material on tuning garbage collection