In this post, we’re going to take a deep dive into a bug that was reported against the 13.9.0 version of Asterisk shortly after its release. This blog entry will be written in a way so that people unfamiliar with VoIP or the Asterisk project can read about this issue and maybe learn a thing or two from this bug.
To start, I would like to give huge thanks to Rob Thomas from Sangoma for helping significantly on this issue by providing me with a test server that was exhibiting this problem. In addition, he also did the work to bisect to determine the commits that introduced the bug.
The Bug Report
The issue report can be found here.
SIP Clients were registering contacts with Asterisk with a certain expiration. Asterisk was then deleting those contacts before they expired. Some reports said that Asterisk was deleting the contact 15-30 seconds before the expiration. Other reports said that Asterisk was deleting the contact nearly instantly after creating it. In the gap between when Asterisk would delete the contact and the client would re-register, that client could receive no calls.
I’m going to gloss over some of the initial details of the bug investigation because they led to dead ends.
As it turned out, there were actually two factors leading to this issue. The first was the one that Rob helped me out with by pin-pointing a commit that caused the problem. As it turned out, between the release of Asterisk 13.9.0 and when I started working on the issue, the issue had apparently been fixed. Rob helped me to find that there was a bug in generating in-memory contact objects based on data stored in the internal sqlite storage backend.
However, there was a second, much more interesting problem that I ran into. I had already determined during my investigation that Asterisk was storing contacts properly in sqlite and had configured the proper expiration time. The problem lay in the code that expired those contacts. Here is a summary of the algorithm used to expire contacts.
- A thread sleeps for a configurable period (default 30 seconds).
- Wake up.
- Search sqlite for all contacts whose expiration timestamp is now or prior to now.
- For each matching contact found, create an in-memory contact object.
- Perform one more sanity check using the created object to ensure we are actually beyond the expiration.
- Delete the contact from sqlite.
Rob had helped to find that step 4 had a bug in it and that we were not properly creating the in-memory contact. He discovered that the bug was introduced in this commit.He also discovered that this problem ended up getting fixed by this commit. If you are curious about what parts of those commits are relevant, look for the changes to the res/res_sorcery_astdb.c. I will leave it as an exercise to the reader to see what was going wrong there.
Once step 4 was fixed, the sanity check in step 5 prevented the contacts from being deleted prematurely.
However, I was wondering why we were ever progressing past step 3. We were storing the contact expiration timestamp in sqlite properly, so why were we thinking that it was time for it to be expired?
A Deeper Look
In order to reproduce the issue, I had set the configurable thread sleeping period in step 1 to 5 seconds. This meant that trying to use a debugger to step through would be a pain since it would potentially break every 5 seconds, and we could not necessarily guarantee we would see the problem happen each time we hit the break point. So instead, I peppered some debug messages in the code and waited to see what happened. And that’s when I saw it. The following line of code is in a function called ast_strings_match():
scan_numeric = (sscanf(left, "%f", &left_num) && sscanf(internal_right, "%f", &right_num));
In case you are unfamiliar with sscanf, it takes a string and converts the contents to some other type. In this case, the “%f” says to try to convert the value to a single-precision floating point value. In this particular case, the variables here are as follows:
- left: The string representation of the contact’s expiration timestamp
- left_num: The converted floating point representation of the contact’s expiration timestame
- internal_right: The string representation of the current timestamp
- right_num: The converted floating point representation of the current timestamp
I added a debug message right after that to see what the result of the conversion was.
[May 19 11:37:49] NOTICE: strings.c:302 ast_strings_match: Left is 1463675925, left_num is 1463675904.000000. internal_right is 1463675869, right_num is 1463675904.000000
WHAT?! That conversion is WAY off!
As you can see, the current timestamp is 1463675869, and the expiration is 1463675925. That means that the expiration is still 56 seconds away. But yet both string representations ended up being converted to the same floating point number. And so the contact was expired long before it was supposed to.
I initially had no idea what to think here. Was there some bug in sscanf? In glibc? Have I forgotten the range of numbers in floating point? How could the conversion to float be that badly off?
An Unlikely Tool
After exhausting my Google-fu trying to find out what might be going on here, I decided to open an unlikely tool: the calculator. I typed the contact expiration timestamp into the calculator
Since I had the calculator in programming mode, the binary representation of the number is shown. Since the conversion seemed to have a loss of precision, I decided to start zeroing the least significant bits. I went one-by-one until I finally had this:
Bingo! It looks like the conversion to floating point is just losing some binary precision. With that, I dusted off my knowledge of floating point numbers to try to get to the bottom of this.
A Review of Floating Point Numbers
For those of you unfamiliar with IEEE-754 single-precision floating point numbers, they’re mapped out as follows:
- Bit 31 is the sign bit. If 0, the number is positive. If 1, the number is negative. We’re going to ignore this for the most part since we are dealing with positive values only.
- Bits 30-23 are the exponent. The number here has 127 subtracted from it in order to determine the exponent of the floating point number.
- Bits 22-0 are the significand. These represent the binary numbers after the decimal point, with an implied 1 preceding the decimal point.
Floating point is a binary form of scientific notation. We might choose to represent the decimal number 12345 as 1.2345 X 104 for instance. With floating point, we do the same thing, except that we represent the number times a power of 2 instead of a power of 10.
The way to read a floating point number is, 1.[significand] X 2exponent-127. As an example, if the exponent were 127, and the significand was 1, then that would be read as
1.1 X 2127-127 or 1.1 X 20 or 1.1, which when converted to decimal is 1.5.
Applying the Knowledge
So let’s examine those conversions again. Let’s start with 1463675925 and see if we can naively come up with a way to represent it as a floating point number. In binary, it is:
101 0111 0011 1101 1110 1100 0001 0101
Now, if we pretend there is a decimal point at the very end of that number, we can move that decimal point to the left one place and we end up with:
101 0111 0011 1101 1110 1100 0001 010.1 x 21
We can continue moving the decimal to the left until we have only a single 1 on the left side of it. We end up with:
1.01 0111 0011 1101 1110 1100 0001 0101 x 230
That looks a lot like how floating point numbers are represented! But, there’s a problem here. Count the number of bits after the decimal place. There are 30. In a single-precision float, we can only have 23 bits in the significand. So the algorithm being used by sscanf appears to just chop off the 7 least significant bits, leaving
1.01 0111 0011 1101 1110 1100 0 x 230
And that, when calculated out and converted to decimal, is 1463675904. That’s what we saw in the log message!
That makes sense, but using that same method won’t work for the current timestamp’s conversion. 1463675869 is less than 1463675904. Chopping bits off the end of it will not make the number go up. Well, let’s take a look at the binary representation of 1463675869. It is:
101 0111 0011 1101 1110 1011 1101 1101
If you use the same method as before for converting to float, you first move the decimal over to the most significant 1:
1.01 0111 0011 1101 1110 1011 1101 1101 X 230
Now this time, instead of just chopping off the final 7 bits, let’s inspect them:
That’s a fairly high number. In fact, since its most significant bit is 1, that means it is more than half-way between 0 and the maximum value that can be expressed in 7 bits. So maybe what’s happening here is that rather than just chopping these bits off when converting to floating point, the conversion is rounding up. If that happens, then that means you end up with
1.01 0111 0011 1101 1110 1100 0 x 230
That is exactly the same as what we saw with the previous conversion! We at least have some sort of logical basis for why these sorts of conversions might be happening.
I once again started performing searches trying to find if there was any evidence to support my findings. I found a document called “What Every Computer Scientist Should Know About Floating-Point Arithmetic“. It is the longest, most boring thing I think I have ever seen. It would not surprise me if the rounding logic were explained somewhere in that paper, although the way it expresses topics, I’d probably read right past it and not realize what I had just read.
Searching further, I found the Wikipedia page on single-precision floating point format. On that page, there is a section called “Precision limits on integer values” that sheds some light on what I was seeing. To paraphrase and to emphasize only the non-negative values,
Integers in the range from 0 to 224 can be represented exactly.
Integers in the range from 224+1 to 225 can be rounded to the nearest multiple of 2 (i.e. the nearest multiple of 21).
Integers in the range from 225+1 to 226 can be rounded to the nearest multiple of 4 (i.e. the nearest multiple of 22).
Following this pattern, we can say that, for every n > 24:
Integers in the range from 2n-1+1 to 2n can be rounded to the nearest multiple of 2n-24
Our timestamps were in the range of 230+1 to 231. This means that the values in that range could be rounded to the nearest multiple of 231-24, or 27, or 128. This lines up exactly with the exercise above in which we used the 7 least significant bits to either round up or down.
Applying this knowledge to our problem
So we now know that given the range of our timestamps, we can expect the value to be rounded to the nearest multiple of 128. The worst cases happen when the expiration contact rounds down and the current timestamp rounds up. In that case, we end up expiring the contact much earlier than expected. The absolute worst case is when the contact is set to expire 63 seconds after the nearest multiple, and the current timestamp is 64 seconds before the current multiple, meaning that we would expire the contact 127 seconds early. Consider what would happen in a busy call center if an agent is unreachable by phone for two minutes…
Furthermore, while the bug that exposed the problem had to do with contact expiration timestamps, the faulty line in question is in a core routine that could be used for nearly anything, so it is vital to get it fixed. Given that this problem had manifested only in the recent 13.9.0 release, we decided that a regression release of 13.9.1 was in order.
The fix for this problem was pretty straight-forward. All we needed was more precision in integer conversions, so we switched from using single-precision floating point numbers to double-precision floating point numbers. Depending on the implementation, you can represent any integer up to 252 or even 264 exactly. With this fix in place, the values were converted precisely, and the issue was now completely gone.
You can view the commit that fixed the issue here
Always be sure you know what you are doing when using floating point numbers. This applies extra when you are using single-precision floating point numbers. I had it drilled into me ever since being introduced to floating point numbers that they require special treatment. However, every lesson I learned was either
- Numbers that are trivial to represent in decimal are not possible to represent in floating point. For instance, 0.3 is simple in decimal, but becomes a repeating infinite decimal in binary, resulting in an imprecise floating point representation.
- Floating point arithmetic, especially equality checks, can result in unexpected results. This means you should always build in tolerances to your floating point checks.
And while I had been shown very clearly that double-precision floats could represent more decimal places than single-precision floats, it never once had occurred to me or been taught to me that some integers might not be representable. I had always thought that since the range of single-precision floats was from -2128 to 2128 that every integer within that range was representable. It obviously makes sense now why that is the case now that I have thought about it more, but I’m a bit surprised that in my years of taking computer science courses and working professionally as a software developer, I never once came across this.