Miserable performance of cyrus-imapd 2.3.9 -- seems to be locking issues
Ian G Batten
ian.batten at uk.fujitsu.com
Tue Mar 4 05:01:19 EST 2008
On 28 Feb 08, at 2256, Kenneth Marshall wrote:
> It may be that the software RAID 5 is your problem. Without the
> use of NVRAM for a cache, all of the writes need all 3 disks.
> That will cause quite a bottle-neck.
In general, RAID5 writes require two reads and two writes,
independent of the size of the RAID5 assemblage. To write a given
block, you read the previous contents of the block you are updating
and the associated parity block. You XOR the previous contents with
the parity, thus stripping it out, and then XOR the new contents in.
You then write the new contents to the data block and the updated
parity to the parity block.
New Partity = Old Parity xor Old Contents xor New Contents
In the absence of NVRAM this requires precisely four disk operations,
two reads followed by two writes.
A naive implementation would, as you imply, use all the spindles. It
would read contents of the parity stripe from the spindles not
directly involved in the update, compute the new parity block, and
then write the data block and the new parity. For an N disk RAID5
assemblage that's N-2 reads followed by 2 writes, N operations.
Now as it happens, for the pathological case of a 3-disk RAID5
assemblage, the naive implementation is better than the more standard
implementation. I don't know if any real-world code is optimised for
this corner case. I would doubt it: software RAID5 is a performance
disaster area at the best of times unless it can take advantage of
intimate knowledge of the intent log in the filesystem (RAID-Z does
this), and three-disk RAID5 assemblages are a performance disaster
area irrespective of hardware in a failure scenario. The rebuild
will involve taking 50% of the IO bandwidth of the two remaining
disks in order to saturate the new target; rebuild performance ---
contrary to intuition --- improves with larger assemblages as you can
saturate the replacement disk with less and less of the bandwidth of
the surviving spindles.
For a terabyte, 3x500GB SATA drives in a RAID5 group will be blown
out of the water by 4x500GB SATA drives in a RAID 0+1 configuration
in terms of performance and (especially) latency, especially if it
can do the Solaris trick of not faulting an entire RAID 0 sub-group
if one spindle fails. Rebuild still isn't pretty, mind you.
More information about the Info-cyrus