Continuous Integration Update

Back in December in my The Continuing Saga of Continuous Integration blog post I wrote about how we reduced the Testsuite’s “27” layers of file system access down to 3 by moving the Docker container’s /tmp filesystem to be memory backed.  That reduced the number of individual test failures by quite a bit but still only about 20% of the Gerrit reviews submitted to Jenkins for testing were passing and getting automatically merged.  After quite a bit of head scratching, Joshua Colp determined that we were still seeing storage I/O latency on the order of seconds, expecially when the Testsuite was starting Asterisk.  After even more head scratching, we decided to try changing the underlying VM disk image storage path.

My earlier post showed each VM host using XFS filesystems to store the Gluster bricks (Gluster Overview).  What it didn’t show was that the XFS filesystems were sitting on top of LVM Logical Volumes, Volume Groups, and Physical Volumes before actually getting to the SSDs.  This is, in fact, the recommended architecture for an oVirt Hyperconverged cluster but it just didn’t seem optimal.  But what were the alternatives?  Well, the most straightforward one was to replace the XFS/LVM architecture with Btrfs directly on the SSDs.  Why?  Well, the first reason is that Btrfs has built-in optmizations for SSDs which XFS doesn’t.  The second is that Btrfs’s “chunk” size of 1G fits better with the Gluster “shard” size of 512MB.  Finally, although LVM’s performance penalty is miniscule, Btrfs does its own multi-volume management so we don’t need the added configuration complexity of LVM.

The results:

Using Gluster’s profiling tools, we took before and after samples of WRITE operations across the 9 Gluster bricks in the cluster.

XFS over LVM over SSD

%-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
    98.51   77040.25 us      56.57 us  588832.10 us          12556       WRITE
    97.00   67595.11 us      61.44 us  687770.44 us          15702       WRITE
    98.29   83070.62 us      89.23 us  646568.83 us          12175       WRITE
    96.32   16869.02 us      89.15 us  261031.69 us          12175       WRITE
    96.40   20934.36 us      61.23 us  310591.13 us          15707       WRITE
    97.05   17836.49 us      84.40 us  216012.16 us          12563       WRITE
    98.27   85541.49 us      82.60 us  950387.41 us          15269       WRITE
    97.45   64830.92 us     123.76 us  742465.42 us          12175       WRITE
    98.02   64410.35 us     110.84 us  511794.17 us          12566       WRITE

Btrfs over SSD

%-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
    70.32    7801.45 us      92.02 us   92014.65 us          10799       WRITE
    75.11    9286.58 us      97.14 us  219555.16 us          12964       WRITE
    75.04    8248.24 us      98.43 us  307658.27 us           8651       WRITE
    78.25   14517.57 us      99.64 us  275937.02 us          12964       WRITE
    79.68   11640.37 us     101.36 us  360742.45 us           8651       WRITE
    78.42   11622.20 us      89.40 us  194180.54 us          10800       WRITE
    78.07   10995.76 us     110.79 us  161713.40 us          12965       WRITE
    73.79    8206.68 us      90.27 us  122550.16 us          10800       WRITE
    79.77   11548.65 us     104.08 us  300292.61 us           8651       WRITE

That’s a significant improvement!

So now what was the ultimate result from a Gerrit review auto-merge perspective?

The Gerrit auto-merge rate went from 20% to 90%!

Of course it’s not just the auto-merge rate we’re happy about.  Since the tests themselves are more reliable, the results are also more meaningful.  When we see failures, they’re more likely to be real issues with the code rather than artifacts of the testing infrastructure.

Leave a Reply

Your email address will not be published. Required fields are marked *

The reCAPTCHA verification period has expired. Please reload the page.

About the Author

What can we help you find?