Miserable performance of cyrus-imapd 2.3.9 -- seems to be locking issues

Tue Mar 4 05:01:19 EST 2008

On 28 Feb 08, at 2256, Kenneth Marshall wrote:

> It may be that the software RAID 5 is your problem. Without the
> use of NVRAM for a cache, all of the writes need all 3 disks.
> That will cause quite a bottle-neck.

In general, RAID5 writes require two reads and two writes,  
independent of the size of the RAID5 assemblage.  To write a given  
block, you read the previous contents of the block you are updating  
and the associated parity block.  You XOR the previous contents with  
the parity, thus stripping it out, and then XOR the new contents in.   
You then write the new contents to the data block and the updated  
parity to the parity block.

New Partity = Old Parity xor Old Contents xor New Contents

In the absence of NVRAM this requires precisely four disk operations,  
two reads followed by two writes.

A naive implementation would, as you imply, use all the spindles.  It  
would read contents of the parity stripe from the spindles not  
directly involved in the update, compute the new parity block, and  
then write the data block and the new parity.  For an N disk RAID5  
assemblage that's N-2 reads followed by 2 writes, N operations.

Now as it happens, for the pathological case of a 3-disk RAID5  
assemblage, the naive implementation is better than the more standard  
implementation.  I don't know if any real-world code is optimised for  
this corner case.  I would doubt it: software RAID5 is a performance  
disaster area at the best of times unless it can take advantage of  
intimate knowledge of the intent log in the filesystem (RAID-Z does  
this), and three-disk RAID5 assemblages are a performance disaster  
area irrespective of hardware in a failure scenario.  The rebuild  
will involve taking 50% of the IO bandwidth of the two remaining  
disks in order to saturate the new target; rebuild performance ---  
contrary to intuition --- improves with larger assemblages as you can  
saturate the replacement disk with less and less of the bandwidth of  
the surviving spindles.

For a terabyte, 3x500GB SATA drives in a RAID5 group will be blown  
out of the water by 4x500GB SATA drives in a RAID 0+1 configuration  
in terms of performance and (especially) latency, especially if it  
can do the Solaris trick of not faulting an entire RAID 0 sub-group  
if one spindle fails.  Rebuild still isn't pretty, mind you.

ian