Hi folks! Today, let’s take a look at a recent Asterisk issue: ASTERISK-29365.
The graceful shutdown is intended to allow every part of the running Asterisk process a chance to shutdown, rather than being abruptly halted. The intention being that provided the chance to do so, each running task should be able to de-allocate all memory and halt successfully. As taskprocessors are expected to do this very kind of thing, their design must adhere to this requirement.
So must also the taskprocessor mechanism itself.
In the case of ASTERISK-29365, the taskprocessor mechanism shutdown process created a race condition with any running taskprocessors that could result in a NULL reference on shutdown. As we are trying to do so gracefully in order to demonstrate that we can, this is quite annoying.
So why was this happening?
Part of the taskprocessor mechanism is a static structure tps_singletons ao2 container.
/*! \brief tps_singletons is the astobj2 container for taskprocessor singletons */ static struct ao2_container *tps_singletons;
This container holds references to all running taskprocessors, which are linked to the container upon allocation. As each taskprocessor runs and completes tasks, it checks how many references to itself still exist. Once we get down to the final 2 (the container and the listener that operates the taskprocessor) then the taskprocessor unlinks itself from the container and signals the listener to shutdown.
As each taskprocessor handles it’s own shutdown, so does the taskprocessor mechanism. To help with this Asterisk contains a mechanism by which you can register a call back on shutdown. The taskprocessor mechanism takes advantage of this on initialization to register a call back when Asterisk is ready to shutdown.
What was happening in this case is that the call back to the registered shutdown function:
static void tps_shutdown(void)
…was cleaning up it’s references to the tps_singletons container (as it should) without checking to see if there were any taskprocessors still running with references to it (which it should have but didn’t.) If one of those taskprocessors was still around and attempted to unlink the container after the tps_shutdown callback cleared it, the result was a bad ao2 reference and we weren’t able to shutdown gracefully, making everyone very upset.
So in this case we could do a few things. We could check that the container still existed and if not abort, but then we would have to check each an every call as the container could be removed at any time. We could go and find all taskprocessors and manually kill them, but then their data may not get cleaned properly if there were still tasks in its queue.
So in this case, we choose to give the taskprocessors a reasonable chance to cleanly shut down and assume that after that time any references that still exist to the tps_singletons are due to bugs elsewhere.
So what can we learn from this?
When Asterisk shuts down, all modules are unloaded before the final call backs. As these modules unload they should also terminate their own taskprocessors. When they do this, they can choose to join the taskprocessor thread until it terminates or allow the taskprocessor thread(s) to shutdown without joining.
If we are simply unloading a module it’s not a tremendous deal to let the running taskprocessor threads clean themselves up, as they are designed to do. Even in most shutdowns. But we are trying to preserve data. So unless we add the requirement that all modules ensure their running taskprocessors are cleaned up before returning from an unload then we have to allow the taskprocessor shutdown process time for this to occur.
Locking quasi-global data (static to our loaded mechanism) is generally how we handle keeping data consistent between running threads that share it. But in this case the shutdown performs an action that can’t be locked – freeing the data – so we must make sure that all references to it are removed before proceeding.
This also allows for a clean graceful stop.
Ignoring issues at shutdown is easy, I mean we are shutting down anyway, right? The problem is, crashing on shutdown causes issues on a production system. In this case automated tests were throwing errors intermittently which made them prone to false errors.
Tests that give false errors aren’t very useful and code that generates core files because it can’t shutdown properly are not things we want to leave behind.