High availability ... again

Tue Jun 22 19:02:29 EDT 2004

On Tue, 2004-06-22 at 15:54, Jure Pe=C3=A8ar wrote:
> It _can_ break in a spectacular way ... easily.
>=20
Yep...

> Our batch of disks turned out to have some fubar firmware, which caused=
 them
> to randomly fall out of the array under a specific load. That problem w=
ent
> undetected during the testing phase.
> So after you manage to get the array back together, you have a heavily
> corrupted filesystem. Journaling and fast recovery? Not this time. It t=
urned
> out that a full fsck on a half a terabyte of reiserfs takes about 3 day=
s to
> finish. (That was more than a year ago, since then reiserfsck has impro=
ved a
> lot).
>=20
The question I ask clients that are considering something like this is,
"What's the worst single failure you could have and how long will it
take to recover from it?". The answer of course would be a failure of
the SAN device and for large ones the recovery, once the SAN is fixed,
can be measured in days.=20

For a mail system, a Murder with multiple Front End systems and a bunch
of Back End systems is more resilient. Yes, if a Back End fails some
users will be without mail for a bit. That can be as short as a few
minutes if INBOX recovery from backup isn't considered since you can
easily create those users INBOX's on a different Back End. And even
doing a full restore of a Back End can be kept down to a few hours by
limiting the storage on a Back End to say 60Gb or so.
--=20
The instructions said to use Windows 98 or better, so I installed
RedHat.


---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html