Story hour came early this week, when a colleague stuck his head in my office and asked why we wait before reusing an address on Windows Server 2003. So began a most harrowing tale, the strange case of the mysterious closing connections.
It came from the beyond
Once upon a time, in a protocol stack far, far away, H.245 sockets were dying inexplicably. Packet captures on both sides saw the other side as closing the connection immediately after it was opened, causing the call to fail. It only happened during high load, and after further inquiry we discovered that it only happened when a socket was opened immediately after a socket with the same address was closed. Could it be that the socket was communicating with the dead?
Where do dead sockets go?
If they were good sockets, they go to socket heaven, also known as the TIME_WAIT state. In socket heaven, they wait until they are reborn. In Linux socket heaven, they wait 60 seconds. In Windows socket heaven, they wait 120 or 240 seconds, depending on the specific version. In socket terms, these times are an eternity, two eternities and four eternities accordingly. The socket’s address cannot be used in this time, but there is a shortcut: if the socket is set to reuse the address, it can be used again instantly.
Here’s the catch: if a packet somehow is delayed in the network, the new socket using the same addresses treats it as a packet meant for it. If the packet was meant to terminate the dead connection, for instance, if both sides terminated communications at the same time, the new socket will terminate. After close examination, we confirmed that this is what was happening in our case: both sides closed the H.245 connection at the same time, and then one of them reused the address and tried to reconnect, immediately receiving the connection close message left over from the previous connection and closing the connection and the call. We had to disable the ‘reuse address’ option, which forces the connection to wait, which in turn can cause the ports to run out under high load.
So what can be done to solve this issue?
1. Increase the size of heaven
Windows used a default port range of 1025 through 4999, which isn’t nearly enough. In Windows Vista and in Windows Server 2008, the new default start port is 49152, and the default end port is 65535 – which should last until the ports are available again in any reasonable traffic. On Linux it should be higher, but it depends on the exact kernel used. There is a nice guide here for increasing the port range on any system.
2. Limit the time in heaven
The fact is 60 seconds is a lot of time, not to mention 120 or 240 seconds. I recommend configuring Windows to a shorter delay (here), or tune up you Unix/Linux (here and here). In fact, let me Google that for you. I see no reason to use anything above 60 seconds, and even 30 seconds seem like eternity, in network terms, unless you expect to work in a high latency network, in which case, just take the highest latency expected and double it.