Finding slow resource leaks is key to keeping your application servers up and running for very long times. Here I explain how I use Java-monitor to spot slow resource leaks and how to verify that they are actual leaks and not just extra pre-allocation into some HTTP connector or database pool. If you follow these steps you can get out of the periodic restart and really start relying on your servers.
Slow resource leaks are a problem because they force you to restart your application server periodically. That in turn means that you always have to keep your eyes on the servers because they might blow up at any time. Ultimately, you cannot rely on servers that suffer from slow resource leaks. Tracking down and resolving slow resource leaks takes time and patience. It is not really hard to do, once you did it a few times.
You also need a monitoring tool that shows statistics from your server over weeks or even months. You can do this with Java-monitor if you enable longer data storage on your servers. To enable this in Java-monitor first start using Java-monitor for your servers
. On the page with the graphs of your server, you can enable the longer data by switching to the "1 week" or "1 month" view, as shown below.
The first step is to look at a server's data over the past week. For finding resource leaks, I find that plotting data over a week gives me a better view than Java-monitor's standard two days of data does. Peaks smooth out and the graphs allow me to look at how for example memory use develops over time.
In order to predict impending outages and slow resource leaks I look for two patterns. The first pattern to check is that the resource usage does not rise over time, but instead rises and falls with the daily tide of traffic increasing and deceasing. When a day's traffic has subsided, I expect the resource usage to drop back to a nominal level. I check that resource usage drops back to the same level every time.
Note that I deliberately use a vague term like 'resource'. In practise I look at all graphs for memory, file descriptors and HTTP connector and database connection pools. All of these are candidates for slow resource leakage.
When I see a resource that seems to be growing slightly I switch to the monthly view of the data, or even the yearly view. This compresses the data in the graph and makes slow resource leaks much easier to spot.
Here is an example of a suspicious change in resource usage. In this case suddenly the garbage collectors on the system became more active. There was no change in the system or the system's load at the time of this chance.
The second pattern I look out for is steep drops of resource allocation after server restarts restarts. When there is a significant drop in resource usage when an application server was restarted, this may indicate a slow resource leak. After a restart, it is normal to lower resource usage as before a restart, since all pools and caches begin empty.
Here is an example of restart related change in resource usage that does not
indicate a problem. For some time after the restart (a few days even) the resource usage has changed, but it then returns to its normal level and stays there.
And here is another one. Even more suspicious, because the resource usage does not grow back after a restart. It just drops and stays down. This is almost certainly a problem.
Once I identified a resource usage pattern as suspicious, I have to confirm that this is actually a slow leak. It may well be a cache or a pool filled up some more, leading to higher resource usage over time, without being an actual leak. In this step I flip though the graphs of thread and database pools and the graphs for HTTP sessions. I try to explain the elevated resource use from there graphs.
For example, here we see the threads that are active in a JVM, with some strange dips in the number of threads. These dips show around system restarts.
However, if I compare that graph with the HTTP connector pool graph on the same server I see that the number of pre-allocated threads in an HTTP connector pool has similarly wobbled up and down.
This probably means there is no slow resource leak. Most likely, the extra threads in the pool keep state around that causes the other resources to also grow. I might make a note to keep an eye on this server specifically, but I don't start digging into the resource leak just yet.
Only when I cannot explain the extra resources allocated from the pools and number of active HTTP session I start really digging. As explained in this followup post: how to find the root cause of slow resource leaks