Thread Tools Search this Thread Display Modes
Old 30-05-2009, 18:27
kjkoster kjkoster is offline
Forum Operator
Join Date: Jul 2008
Posts: 1,117
Default Using time-outs on network connections

Dear All,

Recently, we had some issues that probes would be 'stuck'. Further investigation revealed that there was a mismatch between what the probe was thinking it was doing (sending data and waiting for confirmation from the server) and what the server was thinking (just rebooted and there were no probes).

This falls into domain the first of Peter Deutsch's fallacies of Distributed Computing: The network is reliable.

The problem here was that our probes were relying purely on what the network was telling them about the state of their connection. As the aforementioned fallacy argues, the network is not reliable and we therefore cannot rely solely on the information we get from it.

In our case, the problem was discovered fairly quickly, since some of your probes just stopped working. In other systems, these problems may not be as easy to see. Maybe the 'stuck' network connection is doing some background data loading. In that case, all we'd see is that some minor part of the system ceases to function.

In high volume systems, network connections that stop functioning may cause the feeding processes to back up. They will queue the data that should be going out onto the wire. Such queues are usually in memory. The longer the data has to wait, the more memory is needed to store it. If the data is never sent out onto the wire, it stays in memory until your system dies.

The problem can easily be remedied by specifying time-out values on your network connections. The longer the time-out, the more data that may be queued up on the network connection. Specifying an infinite time-out may cause an infinite amount of memory to be used in the queue to the network connection.

For the Java-monitor probes, which use HttpURLConnection, we do this by specifying the connect and the read timeouts. Instead of getting stuck, the probes will go into their normal error handling cycle and restart automatically.

The tricky part is to determine the correct timeout value. For Java-monitor that is easy, because data is only 'fresh' for a minute or so and connections only live for a few seconds, max. We use a time-out of two minutes.

For your application, setting the right time-out may need some more thinking. You don't want a time-out that causes connections to fail left and right. Sometimes networks take time (as per fallacy number 2: Latency is zero). On the other hand, infinity is a long time. On LAN's a few seconds might be good. On the Internet, you should probably think in minutes.

I hope this helps.

Kees Jan
Reply With Quote

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump