The Continuing Saga of Continuous Integration

The Continuing Saga of Continuous Integration

or… “Watch This Space”

If you’re an Asterisk contributor you’ve probably noticed that we’d been having issues with large numbers of Jenkins test failures during the “gate” phase of the Gerrit review process.  Some tests were failing consistently and others seemed random.  After a lot of head scratching we finally figured out the major contributor to the failures.  tl;dr;  It was the /tmp filesystem.  To understand how this was affecting the tests, you have to understand the intricacies of the Asterisk Testsuite and the virtualization environment the Testsuite runs in.

If you’ve had the pleasure of working with the Testsuite, you know that pretty much every test is timing dependent.  SIP packets and AMI events have to be received in the expected order and within the expected time frames for a test to pass.  The Testsuite is also more disk I/O  intensive than most people realize as it’s constantly writing temporary config files, log files, starting and stopping Asterisk and sipp, etc.  For this reason, the availability of disk I/O bandwidth  can have a big impact on ordering and timings.

While CPU and memory distribution is a snap to tune in most virtualized environments, disk I/O is one of the hardest things to tune.  You can have the fastest SSDs on the planet but if an application has to go through 27 layers to get to it, it won’t matter.  In our case, this was the issue.

Here’s what we had…

  • Docker Container
    • Asterisk Testsuite
      • /tmp on ext4
        • Docker Host (virtual machine) btrfs filesystem
          • oVirt/libvirt virtio-scsi driver
            • oVirt/libvirt VM host
              • QEMU QCOW2 disk
                • Gluster distributed filesystem
                  • 20G dedicated storage network
                    • VM Host 1
                      • Host xfs filesystem
                    • VM Host 2
                      • Host xfs filesystem
                    • VM Host 3
                      • Host xfs filesystem

OK, it’s not 27 layers but it’s still way too many.  With that arrangement, we consistently had 15-20 test failures per gate.

Here’s what we have now…

  • Docker Container
    • Asterisk Testsuite
      • /tmp on tmpfs (memory backed)

Surprise!  The bulk of the test failures went away.  In fact, about half of the gates have no failures at all and are now auto merging and the ones that do fail usually have less than 5 test failures.

We’re not out of the woods yet.  As mentioned, there are still some chronic test failures but we’re taking hard looks at them to see if they’re environmental or just temperamental and need to be re-written to be more tolerant.

 

Leave a Reply

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

About the Author

What can we help you find?