I’ve been working on several things since my last post. After all, the work of a wise developer never ends (for job-security reasons). The team and I had to evaluate some new RFCs for effort estimations, we had to read some PRDs and MRDs, and of course produce our share of documents to be added to the general pile of documents that no one ever reads once they get approved.
While doing all of that, from time to time, we still need to address issues that we receive from our Customer Support department, cases which they cannot resolve without help from the R&D boys and girls. Surprisingly enough, one of those boys is I.
Last week Ran shared a case about TCP connection closing, and so I couldn’t do without sharing an interesting case of my own.
A bit of history
The case we have at hand is a simple case of type conversion. I’m assuming we all know what that is. We sometimes need to convert a 32-bit integer to 16 bits, or from unsigned long to a signed one, and while doing it, we need to be very careful. As the case of Ariane 5 shows, type conversion can sometimes blow up in your face.
The moral of the story is that if you’re going to use the same piece of software written for your old product (let’s say, the Ariane 4 missile) in your new product (Ariane 5, for instance) as well, you should take into consideration that the new product is most likely to be different.
In this case, the latter was much faster, so some velocity values that were never achieved on the previous version were reached on the new one, including a 32-bit integer value that was later converted to a 16-bit one. Of course, some part of the value was cut off by the conversion, leading to a crash – not a code crash, but of a missile crash. OUCH!
Bottom line: when stepping into that field, you aren’t alone. Giants walked there before you, and crashed.
A bit of SIP
SIP has a nagging habit of retransmissions. Instead of warning you that you’re dialing your mother, and by the end of it, you will surely feel guilty of something, it actually insists on fulfilling your wish. Therefore, it sends an INVITE message, and waits half a second for a reply. If none is received, it sends another one, but this time waits a whole second. Then 2, then 4, then it continues trying every 4 seconds, until 32 seconds have passed since the initial INVITE. During all that time it never stops to ask, “Are you sure you are calling your mother?”
Here’s the “official” documentation for it from RFC 3261, section18.104.22.168:
The 2xx response is passed to the transport with an interval that starts at T1 seconds and doubles for each retransmission until it reaches T2 seconds (T1 and T2 are defined in Section 17). Response retransmissions cease when an ACK request for the response is received.
The case in hand
So we got a call from someone in Customer Support. A customer complained that the last INVITE retransmission was sent after 487 milliseconds instead of 4 seconds. In fact, the last INVITE they experienced should have never been sent, as 32 seconds already passed from the first INVITE. Luckily for us, the customer also supplied a log file with the full details. All our toolkits can generate such logs, which helps us a lot during debugging. It lists API calls, message details and lots of other stuff. For each entry, there is also a timestamp, which is also a huge help.
Here’s the gestalt (edited for your reading pleasure):
19:47:35.238112 Sent 200 OK
19:47:35.740140 Sent 200 OK (retransmission)
19:47:36.741197 Sent 200 OK (retransmission)
19:47:38.742312 Sent 200 OK (retransmission)
19:47:58.748456 Sent 200 OK (retransmission)
19:48:02.749685 Sent 200 OK (retransmission)
19:48:06.750914 Sent 200 OK (retransmission)
19:48:07.239942 Sent 200 OK (retransmission that should not happen)
The Drill down
When a timer expires, our code reports it using a callback. The callback checks what time it is, what should happen next, and if needed sets up a new timer. We use 64-bit variables to measure time in nanoseconds. However, our timer APIs, use 32-bit variables (that represent milliseconds). This means that when the timer is triggered, we get its current value in milliseconds, and then convert it to nanoseconds.
So before setting a new timer, we take the value in nanoseconds, and divide them by a million to convert them to milliseconds. Therefore, our timer callback checks the time and calculates the interval for the next timer, if any. After all, if 31 seconds have passed since the first INVITE, there is no point in setting a 4 second timer. When a timer is set, it should also compensate for processing time and all kinds of other stuff being done between timer sets.
You might ask yourself what is the problem with converting 4 seconds, in their nanoseconds representation, to milliseconds. There is none. But what if, due to some compensation, we need to convert an ugly number such as 3,999,999,999 nanoseconds to milliseconds? We end up with 3,999 milliseconds, losing a lot of precision – almost a whole millisecond. Then, when the timer is expired and the callback is called, all the calculations are inaccurate.
This was indeed the case here. The code should have realized that 32 seconds have already passed since the first INVITE, but due to inaccurate calculations, it concluded that another retransmission should be sent. That, and all the compensations that we were doing, yielded the last unneeded message to be sent after 487 milliseconds.
The funny thing is (at least, for those who think those things are funny) that we had all the information we needed to avoid this in the first place. After all, we had the nanosecond values. We just used the wrong numbers in a wrong way – we used the converted milliseconds value instead.
Needless to say, the fix was a single change in a single line of code, using the right numbers in the right way.
I remember reading an excellent and recommended book from MS-press, “Writing Solid Code“. While not addressing this problem specifically, it does deal a lot with type conversions and the funny crashes you will experience if being light headed about it.
Lucky for us, we’re not building missiles… yet.