A couple of months ago I got a couple of wonderful birthday presents. My lovely geeky girlfriend got me two Western Digital 500 GB SATA 3.0 drives, which were promptly supplemented with a 3ware 9550XS 4-port hardware RAID card. Immediately I came up with the idea for this article. I had just read up on mdadm software RAID (updated reference) so I though it would be perfect to bench mark the hardware RAID against the software RAID using all kinds of file systems, block sizes, chunk sizes, LVM settings, etcetera.
Or so I though… As it turns out, my (then) limited understanding of RAID and some trouble with my 3ware RAID cards meant that I had to scale back my benchmark quite a bit. I only have two disks so I was going to test RAID 1. Chunk size is not a factor when using RAID 1 so that axis was dropped from my benchmark. Then I found out that LVM (and the size of the extends it uses) are also not a factor, so I dropped another axis. And to top it off I discovered some nasty problems with 3ware 9550 RAID cards under Linux that quickly made me give up on hardware RAID. See The trouble with 3ware RAID below. I still ended up testing various filesystems using different blocksizes and workloads on an mdadm RAID 1 setup, so the results should still prove interesting.
I ran all of the benchmarks on my home server: A HP ProLiant ML370 G3. Here are the specs:
- Dual Intel Xeon 3.2 Ghz with hyperthreading, 512 KiB L2 cache and 1 MiB L3 cache each (giving me four CPUs)
- 1 GiB RAM
- Adaptec AIC7XXX SCSI RAID with four 36.4 GiB disks in RAID 5 and two 18.2 GiB disks in RAID 1
- 3ware 9550SXU 4-port SATA 3.0 RAID with two 500 GiB disks in JBOD, running in mdadm software RAID 1
- Debian GNU/Linux 4.0 (Etch)
Yes, it's probably a bit overkill for a home server :-) When doing write speed benchmark, the files were read from the RAID5 unit which can read at about 150 MiB/s, much faster than the 3ware mdadm RAID 1 is able to write. That means that the RAID5 read speed should not have affected my benchmarks.
The trouble with 3ware RAID
The hardware RAID card has been giving me a fair bit of trouble. I bought it because I needed a SATA card that was compatible with PCI-X. Most of the "cheap" software RAID cards out there only support regular PCI, which is too slow. As soon as you start using three or four drives you hit the limit of the PCI bus. My server supports 64bit PCI-X at 100 Mhz which is more than fast enough. Sadly I couldn't find a simple SATA card with PCI-X support so I bought the expensive but high quality 3ware 9550SXU instead.
My problem with the 3ware card was that I couldn't make it go faster than 5 MiB write speed using its hardware RAID 1 features. As it turned out after contacting their help desk, the cards don't go any faster unless you enable the write cache. I had not enabled that feature because 3ware recommends not using it unless you have a battery backup unit to protect the cache in case of a power failure. When I did enable it, the speed was roughly equal to mdadm software RAID, but you run the risk of loosing data some on a power failure. In short, the cards are useless unless you buy an extra BBU for the cards. Strangely enough if you put the disks in JBOD mode, it does work at full speed. Even stranger, in JBOD mode the disks are roughly 4 MiB per second faster at writing with the write cache disabled (85 MiB/s as opposed to 81 MiB/s with write cache enabled in JBOD mode).
I ran a few tests and there was little difference in speed between using JBOD+mdadm or hardware RAID 1 with write cache enabled. I chose to trust Linux more than the hardware and went with mdadm. I read a nice little trick which makes mdadm very useful. When a disk fails in hardware RAID you need to replace it with an equal sized disk or larger. This can be a problem since not all 500 GiB disks have the exact same block count. It varies. If the new disk is smaller you need to shrink and rebuild the RAID. With mdadm you can create a partition that does not quite fill up the disk. When one disk fails, you can put in another disk, create an equal sized partition and be up and running soon. And variation in block count is absorbed by the unused disk space.
The benchmark setup
At first I tried benchmarking my new RAID unit using Bonnie++ and IOzone to determine what the best setup would be for me. I had some problems with both these programs. Bonnie++ only tests the hard drive, not the filesystem. It showed me write speeds of over 80 MiB/sec but these don't come near real-world performance. IOzone is able to test the filesystem as well, but I found it far too complicated to interpret the results. I never found a decent explanation of the "record size" attribute that IOzone tests, and since the results varied widely dependent on that record size, I could not interpret the results correctly. This is not so much a problem with Iozone as it is with my lack of understanding the results. Iozone is an expert tool and I am no expert. It produces vast amounts of data but I don't know how to map that to my typical workload.
Update: The Iozone team sent me an email explaining the record size attribute:
Record size is the size of the transfer that is being testing. It is the value of the third parameter to the read() and write() system calls.
So, instead of using a standard benchmarking suite I wrote my own benchmark based on real work that is performed by my server. My server mainly does three things:
- It receives and processes my backups, which are mostly made up of many small files (95% of it are source code files).
- It hold my music collection. Occasionally it receives a large batch of new ogg files (whenever I have time to sit down and start ripping my CD collection).
- Occasionally I store some images on it (as in disk-images). These are multi-gigabyte files.
I decided to do four tests:
- Reading a number of files, by outputting their contents to /dev/null.
- Writing to the filesystem, by copying files from my primary RAID unit to the filesystem. The average read speed of my primary RAID is much faster than 90 MiB/sec which is the theoretical maximum of my software RAID according to Bonnie++, so there should be no interference.
- Reading and writing at the same time, by copying files from the filesystem to a different location on the same filesystem.
- Deleting the files
I ran all these tests for three collections of files: A collection of 10928 small source code files (maximum size a few KiB each), a collection of 1744 Ogg Vorbis files (around 4 MiB each) and a collection of 3 multi-gigabyte files. Each test was run three times and the results were averaged. In between each test I ran the `sync` and `sleep` commands to make sure that caching would not interfere with the test results. All tests were performed on a 200 GiB mdadm software RAID 1 with the following filesystems:
- XFS with 1 KiB blocksize
- XFS with 4 KiB blocksize
- Ext3 with 1 KiB blocksize
- Ext3 with 4 KiB blocksize
- ReiserFS with 1 KiB blocksize
- ReiserFS with 4 KiB blocksize
- JFS with 4 KiB blocksize
It was not possible to create a JFS filesystem with 1 KiB blocksize, so it's missing from the test.
Results of the benchmark
Below are the results for all the filesystems, split out per test and per file collection.
Benchmarking small files
The benchmark using 10928 small source code files, maximum a few KiB each
Reading speed is about the same across the board. You can see immediately how slow JFS is writing and deleting all these files. What really strikes me is how slow ReiserFS is. ReiserFS is always touted as well-suited for large amounts of small files, but in my benchmark ReiserFS is outperformed by Ext3. As expected, XFS is a bit slower than ReiserFS and Ext3. XFS is usually touted as well-suited for large files.
Benchmarking medium-sized files
The benchmark using 1744 Ogg Vorbis files, on average about 4 MiB each
From this benchmark you can clearly see that XFS is better suited to larger files. It outperforms all other filesystems on all tests. There are two things that strike me here. First off, Ext3 with a 4 KiB blocksize is much slower reading than Ext3 with a 1 KiB blocksize. You would expect it the other way around. Second, both Ext3 and ReiserFS are very slow to delete these files when they have a 1 KiB blocksize, but their speed is on-par with the rest under a 4 KiB blocksize. I expected some difference, but not this much.
Benchmarking large files
The benchmark using 3 multi-gigabyte files
In this test we see again the abysmal poor performance of Ext3 and ReiserFS on a 1 KiB blocksize when deleting files. The rest of this test isn't much of a surprise. 4 KiB blocksizes are faster than 1 KiB blocksizes and at 4 KiB, the filesystems perform about equally. It seems that the advantage of XFS in comparison to Ext3 is lost again with really large files.
I'm not going to make any general conclusions from my benchmark. My tests are pretty non-standard and so is my hardware setup. I chose to use XFS with a 4 KiB blocksize after these tests. It's clearly faster when dealing with my Ogg Vorbis collection and not much slower when handling my backups (small files). These are the two tasks that I do most often on my server.
I'm probably going to get comments from some of you why I didn't test Reiser4, Ext4 or some other filesystem. I didn't test these because they are not quickly to set up under Debian Etch. They require work, compiling kernel modules, etcetera. I'd like to test these if they are easily available under Debian Lenny after it is released but I will need to buy some extra hard drives if I am going to do that. There is more data on my software RAID than will fit on my hardware RAID so I can't shuffle the data around to test again with these disks.
One thing I would like to do in the future (when I have more disks) is to re-run these benchmarks on a RAID 5 array and vary the chunk size. As the Linux Software RAID HOWTO says, the combination of chunk size and block size matters for your performance. For a RAID 1 array this doesn't matter since there is no chunk size to deal with. Then I could compare those tests with RAID 1+0 and RAID 0+1. But all that is for another time.
I hope these benchmarks were in some way useful for you!