How much testing do we really do?
A lot, and let’s face it, Asterisk isn’t the easiest of software packages to test. Our continuous integration environment could run over 1000 tests on a single change before it’s merged into the codebase. Unfortunately, without a significant amount of work, mostly around directory and port coordination, they can’t be run in parallel in the same operating instance and when run sequentially, a single run of 600 tests can take the better part of a work day. Given that most changes submitted to Asterisk’s Gerrit affect 3 branches, and that the Testsuite must be run for each branch before a change is merged, it was taking unacceptable amounts of time to get changes cleared through the process.
Earlier in the year, we created a new VM cluster and started breaking the Testsuite tests up into several clumps that could be run in parallel on several VMs. We then configured Jenkins to coordinate the parallel jobs and report back to Gerrit with a final vote. This has worked better but now we’re running into other issues. We broke the tests up into 5 Jenkins jobs, but now for a change that affects 3 branches, that’s 15 jobs (15 VMs) running in parallel. Given that the VM cluster doesn’t have unlimited resources we’re still having an issue responding back to Gerrit (and the change author) in a timely fashion when there are multiple changes in the queue and it’s just not practical to keep adding resources to the pool when there are alternatives.
The nice thing about the Testsuite is that it doesn’t take a lot of resources to actually run the tests. As mentioned above, the constraint when attempting to run parallel Testsuite runs is on the network and filesystem. This is a perfect use case for running the tests in containers, specifically Docker, on the same host. Even better is that we can build Asterisk only once and have the tests run in parallel against those build products. Even better than that, Jenkins has built-in facilities to help us with container management.
There are some hurdles though:
We test Asterisk on CentOS, Fedora and Ubuntu and we have scripts that can take the official base x86_64 images of those distributions and turn them into full Jenkins-ready Asterisk development environments. There are no official i686 base images however and we do test in 32-bit environments as well as 64-bit. To create the images, we had to do actual installs of 32-bit CentOS7 and Ubuntu14, clean them out, tar the root file systems, then create Docker images. Unfortunately that’s only half the battle. A Docker image’s architecture isn’t determined by the packages installed in the container. It’s determined by the architecture of the host. If you run a 32-bit CentOS 7 image on a 64-but CentOS 7 host, then run ‘uname -m’ in the container, you’ll get ‘x86_64’ not ‘i686’. If you try to build and/or run Asterisk in this situation, things are going to get messy. Luckily you can use the ‘setarch’ command to control what ‘uname -m’ returns but to use that, we had to put logic in our Dockerfile creation scripts and the scripts that actually build and run Asterisk.
- Understand the Docker-Jenkins Relationship:
This is a fairly complicated relationship, mostly due to the serialization needed for Jenkins to maintain state and data across multiple slaves. When using the Jenkins Pipeline DSL, we had to be very aware of what Jenkins commands will be executed in the Docker container vs on the Docker host (Jenkins slave). For instance, using the built-in Jenkins ‘git’ tools will cause them to always be run on the host, not the container but if you run ‘git’ in a Jenkins ‘shell’ command, it will run in the container.
- Divide and Coordinate the Workload:
Much experimentation was needed to split the total number of Asterisk Testsuite tests into equal length chunks (by execution time) but most of the work was around coordinating the git checkout, asterisk build, asterisk install and testsuite runs. The only way to keep both the time and resource load down was to checkout and compile once in one container, then run parallel containers, one for each test chunk, that installed the build products from the first container then ran the Testsuite for a specific subset of tests. Jenkins then waits for all the containers to complete, then notifies Gerrit of the result.
So did it work?
Well, yeah, kinda. By compiling once then parallelizing the tests into 10 containers, we’ve been able to get a single Gerrit gate (Jenkins job) to run in about 30 minutes on a single modest VM. Even better, we can now run more than 1 Jenkins job on the same VM so now we have more levels of parallelism: Jenkins can manage multiple slaves (VMs) of course, each slave can now run more than 1 job at a time, and each job can now run more than 1 test at a time. So why the “kinda”? We’ve found about a dozen tests that fail consistently when run in a container that don’t fail when run in a VM. We’re not sure why yet and we’re still investigating.
When does it get rolled out?
We’re working through those failing tests now and have already fixed many ARI tests that were actually a result of a reference leak in Asterisk. When the rest are addressed (over the next few sprints), we should be able to move the public Jenkins instance to the new architecture.
Once we’ve rolled out the new architecture, we’ll publish the details of how and what we did on the Asterisk Wiki.