A past post discusses some tips and tricks to employ when encountering a problem. Here we’ll walk through an issue applying some of those techniques. Recently an Asterisk issue came up involving occasional static and/or silence for audio. Broadly speaking the problem can be described as Alice calls Bob (using SIP), Bob answers, the call connects, and then Alice hears white noise, or no audio at all. After some analysis, it was diagnosed to be a problem with DTLS setup initiating, but not fully completing in Asterisk. The developer specific details are below, but I think it’s relevant to first go through how the problem was initially investigated from a user perspective.
Asterisk haz error plz halp!?!
The more information one gives about a problem usually the easier it is to find the solution. Generally, the first place to look are the Asterisk log files. Any warnings, errors, or notices? What about enabling debug and verbose? A SIP call was involved, so any oddity in the SIP trace? Unfortunately in this case there was nothing.
Replication is often paramount, but sometimes can’t be done. Luckily, replication is possible for this issue. Sadly though this particular problem only happens sporadically. But not infrequently enough to be a real problem, so sadness reversed!
Bring the Noise
Alice calls Bob in this instance, and receives an earful of static for her efforts. And it only seems to happen when she calls Bob, and not someone else. Could her phone be the problem? Asterisk? Bob’s phone? Bob himself?!
When dealing with an audio issue it’s always good to rule out transcoding, and usually easy to do. Once that’s done another thing to try is recording the call. Is the problem heard in the recording? In this case it is! That rules out Alice’s endpoint as the problem, so now we only have Asterisk and Bob’s endpoint to deal with.
Bob, here is a WebRTC client. Configuring another WebRTC endpoint similarly to Bob results in the same behavior. That now that excludes Bob as the culprit, which means the problem must be with Asterisk. As well, switching Bob with a non DTLS/SRTP based device does not reproduce the problem.
Developer Time!
At this point it’s safe to say Asterisk sometimes has a problem decrypting RTP packets. Enabling “rtp debug” not only verifies that to be the case, but shows Asterisk is not even trying to decrypt the packet. It’s just passing it along:
Got RTP packet from 10.24.22.255:4062 (type 00, seq 044756, ts 1644329479, len 000160) Sent RTP packet to 10.24.20.47:36685 (via ICE) (type 00, seq 022156, ts 1644329472, len 000160) Got RTP packet from 10.24.20.47:36685 (type 00, seq 016205, ts 2705212720, len 000170) Sent RTP packet to 10.24.22.255:4062 (type 00, seq 022574, ts 2705212720, len 000170)
Hint look at the “len”. 10.24.22.255 is sending Asterisk unencrypted packets as it’s configured to do (It’s a regular non-encrypted SIP endpoint). However you can see Asterisk does not encrypt the packet when sending to the WebRTC client at 10.24.20.47. Similar happens in reverse except Asterisk is receiving and forwarding on the encrypted packet. Otherwise the “len” would change in both examples between receiving and sending.
Since this problem only happens occasionally it most likely means there is a race condition between two threads causing the problem. After instrumenting the code a bit (i.e. adding more log statements) it was narrowed down between the incoming packet read, and ice complete threads.
Basically what’s happening is the packet read thread is in the midst of DTLS handshaking that’s initiated by the client. At the same time ice completes its setup, and in handling that, it too starts the DTLS handshaking process. However, before doing so it resets the underlying SSL object. If the timing is right the packet read thread completes handshaking, but since the SSL object was cleared it doesn’t finalize things. RTP packets are then received, and not decrypted.
Solution
At first is was thought that a flag was needed on the dtls_details object. This flag would indicate that DTLS handshaking had been already initiated or not. Then when ice completed it could check this flag to see if it needed to start the handshaking process. However, after further investigation it was determined that in the ice completed case there was no reason to perform the DTLS setup and reset there. Setup would have already been done, and clearing the underlying SSL object obviously causes problems. So the fix was to simply remove the call to dtls_perform_setup in the on_ice_complete handler.
Hopefully this has been a helpful example of how to walk through and break down an issue. Next time you find yourself having to do similar I do hope some of the ideas discussed here will be of assistance.
One Response
Nice one, thanks