Implement Cyrus IMAPD in High Load Enviromment

Vincent Fox vbfox at ucdavis.edu
Tue Sep 29 12:45:59 EDT 2009


Simon Matter wrote:
> What I'm really wondering, what filesystem disasters have others seen? How
> many times was it fsck only, how many times was it really broken. I'm not
> talking about laptop and desktop users but about production systems in a
> production environment with production class hardware and operating
> systems.
>
> Would be really interesting to get some of the good and bad stories even
> if not directly related to Cyrus-IMAP.
>
>   
So we ran UFS (with logging) on multiple UW-IMAP backends
before moving to Cyrus.  I can tell you at LEAST half a dozen
times we would have some hardware or software crash that
would leave someone looking at  this:

fsck /var/mail Y/N?

The "correct" answer is Y but then you have hours and hours
of downtime so sometimes you say N and cross your fingers.
We had one system someone hit N and left it that way for weeks
not know if it was going to develop cancer at any moment, until we
could migrate users off it.  It seemed working OK but we had no
way to verify that while "hot" and no downtime available in the
intervening perid so we crossed fingers.....

Since I've started working here at UC Davis in 2005 I've seen
double-disk failures in a RAID-5 set THREE TIMES when I had
never seen it in previous 15 years.

I've seen double-controller RAID arrays go into total lockup when
one controller failed and the code that was supposed to switch
smoothly to other controller didn't work.  What's going on inside
that black-box array controller? Who knows.  The original developer
is long gone and all the replacements that upgraded it over the
years don't really know how it all works.  It's often astonishing to
me that Linux admins will use hardware controllers and even EMC
sans for quite large datasets and blindly trust the black box.

RAID6?  I am a member of BAARF.  RAID5/6 are not to be trusted.
See http://www.baarf.com/

So yes I'm the paranoid soul, that if you hand me RAID6 LUNs from
an EMC SAN device, I will ZFS mirror them together for additional
safety on top of that since I know from experience I cannot trust the
black boxes to do what they claim.   Really I'm not trying to
beat anyone over the head with ZFS particularly, I'm just stating
that currently it's the only filesystem I can use in production for
large datasets that I actually TRUST.  I like a lot being able to once
in a while when I replace a disk go "zpool scrub" even during peak
usage hours and KNOW it's all correct.   When Linux has something
similar I'll use it in a second.  Until then I prefer Linux for app servers
and Solaris for back-end storage.

YMMV.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3250 bytes
Desc: S/MIME Cryptographic Signature
Url : http://lists.andrew.cmu.edu/pipermail/info-cyrus/attachments/20090929/96c81caa/attachment.bin 


More information about the Info-cyrus mailing list