High availability ... again
pegasus at nerv.eu.org
Tue Jun 22 16:54:45 EDT 2004
On Tue, 22 Jun 2004 18:52:09 +0200
Tore Anderson <tore at linpro.no> wrote:
> There's a third option, which is the one I prefer the most: shared
> block device. Connect your two servers to a SAN, and store all of
> Cyrus' data on one LUN, which both servers have access to. Then, set
> your cluster software to automatically mount the file system before
> starting Cyrus. You'll need STONITH or IO-fencing to protect against
> file system corruption in a split-brain scenario, but other than that
> it's a fairly simple solution that's unlikely to break in spectacular
> ways. You could share a SCSI cabinet between the servers instead of
> using a SAN, though I can't say I reccomend it - too failure-prone.
It _can_ break in a spectacular way ... easily.
Our batch of disks turned out to have some fubar firmware, which caused t=
to randomly fall out of the array under a specific load. That problem wen=
undetected during the testing phase.
So after you manage to get the array back together, you have a heavily
corrupted filesystem. Journaling and fast recovery? Not this time. It tur=
out that a full fsck on a half a terabyte of reiserfs takes about 3 days =
finish. (That was more than a year ago, since then reiserfsck has improve=
Two things to be learned here ...=20
* use different disks
* filesystem too is a single point of failure
Since then the only type of HA systems i trust are google like setups ...
designed to die, designed to corrupt data, designed to do other ugly thin=
but the app running on them is designed to handle that. Hint: cyrus has r=
for improvments here ...=20
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
More information about the Info-cyrus