or… “Watch This Space”
If you’re an Asterisk contributor you’ve probably noticed that we’d been having issues with large numbers of Jenkins test failures during the “gate” phase of the Gerrit review process. Some tests were failing consistently and others seemed random. After a lot of head scratching we finally figured out the major contributor to the failures. tl;dr; It was the /tmp filesystem. To understand how this was affecting the tests, you have to understand the intricacies of the Asterisk Testsuite and the virtualization environment the Testsuite runs in.
If you’ve had the pleasure of working with the Testsuite, you know that pretty much every test is timing dependent. SIP packets and AMI events have to be received in the expected order and within the expected time frames for a test to pass. The Testsuite is also more disk I/O intensive than most people realize as it’s constantly writing temporary config files, log files, starting and stopping Asterisk and sipp, etc. For this reason, the availability of disk I/O bandwidth can have a big impact on ordering and timings.
While CPU and memory distribution is a snap to tune in most virtualized environments, disk I/O is one of the hardest things to tune. You can have the fastest SSDs on the planet but if an application has to go through 27 layers to get to it, it won’t matter. In our case, this was the issue.
Here’s what we had…
- Docker Container
- Asterisk Testsuite
- /tmp on ext4
- Docker Host (virtual machine) btrfs filesystem
- oVirt/libvirt virtio-scsi driver
- oVirt/libvirt VM host
- QEMU QCOW2 disk
- Gluster distributed filesystem
- 20G dedicated storage network
- VM Host 1
- Host xfs filesystem
- VM Host 2
- Host xfs filesystem
- VM Host 3
- Host xfs filesystem
- VM Host 1
- 20G dedicated storage network
- Gluster distributed filesystem
- QEMU QCOW2 disk
- oVirt/libvirt VM host
- oVirt/libvirt virtio-scsi driver
- Docker Host (virtual machine) btrfs filesystem
- /tmp on ext4
- Asterisk Testsuite
OK, it’s not 27 layers but it’s still way too many. With that arrangement, we consistently had 15-20 test failures per gate.
Here’s what we have now…
- Docker Container
- Asterisk Testsuite
- /tmp on tmpfs (memory backed)
- Asterisk Testsuite
Surprise! The bulk of the test failures went away. In fact, about half of the gates have no failures at all and are now auto merging and the ones that do fail usually have less than 5 test failures.
We’re not out of the woods yet. As mentioned, there are still some chronic test failures but we’re taking hard looks at them to see if they’re environmental or just temperamental and need to be re-written to be more tolerant.