The Story of Asterisk and Keep-Alives
The vast majority of VoIP communications is done via UDP datagrams. It’s a no-overhead protocol which makes it fast and although it also makes it unreliable, the SIP and RTP protocols and our own ears and eyes can tolerate a certain amount of packet loss quite easily. From a signalling perspective however, there are still cases where TCP’s reliability can make SIP more robust and secure when using TLS (sips) over TCP. This is the subject of this post.
What issues are we trying to solve?
There are two we’re primarily concerned with: keeping a stateful firewall from blocking a connection we’re still using, and knowing whether a peer is dead or alive.
Hey, I’m still using this!
Most firewalls employ connection tracking as part of their defense mechanism and your home or office network and even your mobile phone are all protected with them. Firewalls that protect servers, like a web server or a PBX, are manually configured to let all connections to a particular port (usually 5061 in our TCP/TLS case) pass through. It’s the firewalls that protect networks or devices where a client resides that we’re most concerned with since they will not let any incoming packets through to a particular port that you haven’t sent any traffic from.
Once the connection tracking mechanism learns that it’s OK to let packets in, it usually also starts a timer of anywhere between a few minutes to a few days. Every time a packet passes through the firewall, the timer is reset but if no packets are sent or received before the timer expires, the firewall will forget that connection and any further packets will be dropped.
Are you still there?
The other issue we need to solve is knowing whether a peer is actually still available. Let’s say you’re a road-warrior and have a softphone app on your laptop and you’re using TLS over TCP to connect to your office PBX via either WiFi or a mobile network. When you start your softphone app, it usually opens a TCP/TLS connection to the PBX (opening the firewall) and sends a REGISTER letting the PBX know you’re available and where. Once that exchange happens, the connection is usually idle.
The TCP protocol is such that you don’t know if the connection is still alive until you actually try to send data so let’s say that there’s a network issue somewhere that’s preventing packets from flowing. As long as the connection is idle, no one is any wiser but now the PBX needs to send you a call. It tries to use the connection it already knows about and fails because when it sends the INVITE to your softphone, there’s no TCP acknowledgement and the call winds up in voicemail. Because the softphone never tried to send any data, it has no idea that it’s not really connected to the PBX any more and you keep missing calls.
Keep-alives are just a way to keep a TCP connection active. By sending small periodic packets, we can keep the firewall’s connection tracking mechanism from forgetting the connection and if we send a keep-alive and we don’t get an acknowledgement, we’ll know immediately that we’ve lost contact with the peer and can try to re-establish the connection. Limiting ourselves to Asterisk’s chan_pjsip channel driver, we have several ways to implement keep-alives.
OPTIONS messages are full SIP messages that a user agent can send to a peer and expect to get a standard SIP response. You can configure Asterisk to send these messages to a peer by setting the “qualify_frequency” parameter in the peer’s aor object. At that interval, Asterisk will send the OPTIONS and will mark the peer’s contact as available and record the round trip time if it gets a response. If not, the contact is marked as unavailable and, in the case of your softphone, remove it’s registration and close the connection. Since OPTIONS is a SIP level thing, it’s not without overhead. The full SIP stack in both peers has to be invoked and the message itself can be a few hundred bytes so if your using TLS, the message and its response have to be encrypted.
Short REGISTER timeouts
The maximum expiration for registrations can be set fairly low to force peers to re-register often. Like OPTIONS, this involves the full SIP stack and encryption, plus authentication and possibly subscription re-establishment. Not cheap.
The res_pjsip_transport_management module can also send keep-alives in the form of a short two byte packet containing just a CRLF sequence with no SIP overhead. Its configured globally in the “global” section of the pjsip.conf file. The peer probably won’t respond because there’s nothing in the packet except the CRLF but the packet will keep the connection open and if the packet can’t be delivered, TCP will let us know. Even containing just the CRLF, the packet still has to be encrypted though. It’s also code we have to maintain.
As you probably know, we use pjproject as Asterisk’s underlying SIP protocol engine and it has it’s own keep-alive mechanism. At the time we adopted pjproject, it’s keep-alives could only be configured at compile time which is why we created our own. That’s changed but the implementation is exactly like our own and has the same requirement that the packet be encrypted even though it’s only two bytes. The code also has to be maintained AND if keep-alives are accidentally turned on in BOTH pjproject and Asterisk, the potential for deadlocks can occur.
The TCP protocol has it’s own keep-alive implementation which is managed by the operating system’s kernel. No code for us to maintain, no deadlocks, AND because it’s done at the TCP level, these keep-alives don’t use resources for encryption. At the present time, Asterisk does not implement TCP keep-alives.
The Bottom Line
Are YOU still there? It’s a long read for a blog post I know, but we’re just about done. With all of those options, what’s the plan forward?
OPTIONS messages are part of the SIP specification and can be used over any transport protocol so they stay available. As for REGISTER, we can’t very well turn those off but they probably shouldn’t be used as keep-alives. Asterisk and pjproject keep-alives have issues of their own. That leaves us with TCP keep-alives. They offer the best solution to the problems we’re trying to solve and it’s our plan to replace both Asterisk and pjproject keep-alives with them. You’ll be able to configure them on a transport-by-transport basis and we’ll let the operating system do the work. This way, if you’ve been using OPTIONS or REGISTER to keep TCP connections alive, you can probably decrease the frequency at which they are performed and save some resources.