We looked at how to spot slow memory
leaks earlier. In that blog post, we saw how to find slow resource leaks and how to distinguish those from the expected pool growth behaviour.
In this article, I'd like to explain how I go about identifying the root cause of slow resource leaks and ultimately how to fix them. Fixing slow resource leaks almost always requires code changes, so if you are not in the position to change the code, you cannot fix slow resource leaks. In that case, you will have to find someone who can.
The process of identifying the root cause of a slow resource leak is similar to the process of finding a fast leak. The main difference is that time is very much against you. Where for a fast leak you can easily do a few iterations, slow leaks may take weeks or months to manifest. There iterations may well be a quarter of a year. For faster leaks, you can get away with some quick code changes and 'see how it does on live'. For slower leaks that just means you'll never find the real problem.
Steps to follow are:
- identify leaking resource (done, yay!)
- gather evidence regarding that resource
- develop a thesis on what is the problem and see if the evidence supports you
- present your analysis to someone you know will rip your proof to shreds
- devise and implement fix
Note how we usually only quickly do step 3 and then spend all of our energy on step 5, only to find that the real problem was something else. I should know, because I too make that mistake more often than I care to admit.
Once I know what resource is leaking (e.g. memory, file descriptors, CPU cycles, threads), I start gathering evidence about the leak. This evidence comes from several sources. If the leaking resource is something that is used only very little in your code, the application source code is a good place to start gathering. If this is something that is used everywhere (like memory for instance) looking at the source code is usually time consuming and not very productive. So here's a cheat sheet where I look for evidence:
----> heap dumps
----> Java-monitor's thread pool graphs and also lsof or sockstat
----> thread dumps
----> Java-monitor's thread pool graphs and thread dumps
----> enable resource leak detection on JDBC driver
When gathering evidence, it is important to keep an open mind. It is very tempting to jump to conclusions about what is happening. Instead, just look at the cool evidence and eliminate possible problems.
A critical step is to present your evidence and line of thought to someone that is not afraid to challenge you. This step is to guard you from making a mistake in the analysis and having to wait a few weeks before you find out you were wrong. At this stage, someone who just nods and applauds may do more harm than good, however good that person's intentions. Find the right person and tell that person that you need a critical review.
Once your analysis survived that challenge, you can finally start thinking about a solution. The solution needs to have two parts: 1) it needs to improve logging and monitoring in such a way that the system gives you evidence about the problem you just found and 2) it needs to actually solve the problem.
In the off chance that your solution was not correct, the improved logging will give you fresh new evidence to look at for your second (and hopefully final!) iteration.