Back in December in my The Continuing Saga of Continuous Integration blog post I wrote about how we reduced the Testsuite’s “27” layers of file system access down to 3 by moving the Docker container’s /tmp filesystem to be memory backed. That reduced the number of individual test failures by quite a bit but still only about 20% of the Gerrit reviews submitted to Jenkins for testing were passing and getting automatically merged. After quite a bit of head scratching, Joshua Colp determined that we were still seeing storage I/O latency on the order of seconds, expecially when the Testsuite was starting Asterisk. After even more head scratching, we decided to try changing the underlying VM disk image storage path.
My earlier post showed each VM host using XFS filesystems to store the Gluster bricks (Gluster Overview). What it didn’t show was that the XFS filesystems were sitting on top of LVM Logical Volumes, Volume Groups, and Physical Volumes before actually getting to the SSDs. This is, in fact, the recommended architecture for an oVirt Hyperconverged cluster but it just didn’t seem optimal. But what were the alternatives? Well, the most straightforward one was to replace the XFS/LVM architecture with Btrfs directly on the SSDs. Why? Well, the first reason is that Btrfs has built-in optmizations for SSDs which XFS doesn’t. The second is that Btrfs’s “chunk” size of 1G fits better with the Gluster “shard” size of 512MB. Finally, although LVM’s performance penalty is miniscule, Btrfs does its own multi-volume management so we don’t need the added configuration complexity of LVM.
The results:
Using Gluster’s profiling tools, we took before and after samples of WRITE operations across the 9 Gluster bricks in the cluster.
XFS over LVM over SSD
%-latency Avg-latency Min-Latency Max-Latency No. of calls Fop 98.51 77040.25 us 56.57 us 588832.10 us 12556 WRITE 97.00 67595.11 us 61.44 us 687770.44 us 15702 WRITE 98.29 83070.62 us 89.23 us 646568.83 us 12175 WRITE 96.32 16869.02 us 89.15 us 261031.69 us 12175 WRITE 96.40 20934.36 us 61.23 us 310591.13 us 15707 WRITE 97.05 17836.49 us 84.40 us 216012.16 us 12563 WRITE 98.27 85541.49 us 82.60 us 950387.41 us 15269 WRITE 97.45 64830.92 us 123.76 us 742465.42 us 12175 WRITE 98.02 64410.35 us 110.84 us 511794.17 us 12566 WRITE
Btrfs over SSD
%-latency Avg-latency Min-Latency Max-Latency No. of calls Fop 70.32 7801.45 us 92.02 us 92014.65 us 10799 WRITE 75.11 9286.58 us 97.14 us 219555.16 us 12964 WRITE 75.04 8248.24 us 98.43 us 307658.27 us 8651 WRITE 78.25 14517.57 us 99.64 us 275937.02 us 12964 WRITE 79.68 11640.37 us 101.36 us 360742.45 us 8651 WRITE 78.42 11622.20 us 89.40 us 194180.54 us 10800 WRITE 78.07 10995.76 us 110.79 us 161713.40 us 12965 WRITE 73.79 8206.68 us 90.27 us 122550.16 us 10800 WRITE 79.77 11548.65 us 104.08 us 300292.61 us 8651 WRITE
That’s a significant improvement!
So now what was the ultimate result from a Gerrit review auto-merge perspective?
The Gerrit auto-merge rate went from 20% to 90%!
Of course it’s not just the auto-merge rate we’re happy about. Since the tests themselves are more reliable, the results are also more meaningful. When we see failures, they’re more likely to be real issues with the code rather than artifacts of the testing infrastructure.