In the past month I’ve been fixing an issue with Asterisk and PJSIP that I thought would be fun to share in a blog post. The originally filed issue was for a crash experienced when Asterisk was manipulating the reference count of a PJSIP invite session. For those who may be unaware the INVITE session API in PJSIP is used by Asterisk for placing and receiving calls. It takes care of some of the lower level SIP work required to do so and implements functionality around calls that we gain for free (such as SIP session timers, SDP negotiation, and more). When originally deciding how to use PJSIP we decided to leverage the INVITE session API instead of doing it ourselves.
Upon investigation I determined that the INVITE session was not guaranteed to be valid at the time it was used – this was due to us not holding our own reference to the INVITE session. We were relying on the INVITE session to manage its lifetime, which would not always match our own. In fact some code had gone in previously into chan_pjsip to solve this issue in some circumstances but it did not solve the core issue. The fix was relatively easy: keep a reference to the INVITE session for as long as we need it. I made the change, tested it, and put it up for review. Upon review the change went through gates and things broke. A lot.
So what did I miss? I hadn’t built with the MALLOC_DEBUG debugging option enabled. This option will ensure that memory which is freed has a specific value – so that if you access the memory afterwards it will likely cause a crash. This uncovered an ordering issue in my fix and something I hadn’t known previously. PJSIP allocates INVITE sessions from the memory of the dialog to which it is reassociated. I was removing a reference to the dialog before removing a reference to the INVITE session, causing the INVITE session memory to be freed. I swapped the ordering and the issue was resolved! In the end the fix removed more code than it added, but it was a journey to get there.