Recently, we had some issues that probes would be 'stuck'. Further investigation revealed that there was a mismatch between what the probe was thinking it was doing (sending data and waiting for confirmation from the server) and what the server was thinking (just rebooted and there were no probes).
This falls into domain the first of Peter Deutsch's fallacies of Distributed Computing
: The network is reliable.
The problem here was that our probes were relying purely on what the network was telling them about the state of their connection. As the aforementioned fallacy argues, the network is not reliable and we therefore cannot rely solely on the information we get from it.
In our case, the problem was discovered fairly quickly, since some of your probes just stopped working. In other systems, these problems may not be as easy to see. Maybe the 'stuck' network connection is doing some background data loading. In that case, all we'd see is that some minor part of the system ceases to function.
In high volume systems, network connections that stop functioning may cause the feeding processes to back up. They will queue the data that should be going out onto the wire. Such queues are usually in memory. The longer the data has to wait, the more memory is needed to store it. If the data is never sent out onto the wire, it stays in memory until your system dies.
The problem can easily be remedied by specifying time-out values on your network connections. The longer the time-out, the more data that may be queued up on the network connection. Specifying an infinite time-out may cause an infinite amount of memory to be used in the queue to the network connection.
For the Java-monitor probes, which use HttpURLConnection
, we do this by specifying the connect and the read timeouts. Instead of getting stuck, the probes will go into their normal error handling cycle and restart automatically.
The tricky part is to determine the correct timeout value. For Java-monitor that is easy, because data is only 'fresh' for a minute or so and connections only live for a few seconds, max. We use a time-out of two minutes.
For your application, setting the right time-out may need some more thinking. You don't want a time-out that causes connections to fail left and right. Sometimes networks take time (as per fallacy number 2: Latency is zero). On the other hand, infinity is a long time. On LAN's a few seconds might be good. On the Internet, you should probably think in minutes.
I hope this helps.