robustness ...

Sun Aug 3 12:38:55 EDT 2003

On Fri, 1 Aug 2003 09:59:30 -0400 (EDT)
Rob Siemborski <rjs3 at andrew.cmu.edu> wrote:

> Of course, once you're running on a corrupted file system, all bets are
> off.  Any number of things could be wrong: the binaries could have been
> damaged, files may have been reassembled incorrectly, or even be missing
> entirely.

Corrupted binaries are easy to detect and replace; corrupted data is not.
But lets limit the problem scope to the cases, when cyrus starts processing
certain file and hits some unexpected stuff. One such case i reported about
two months ago, looping on the broken skiplist files.

> Cyrus does go to great lengths to defend itself against crashes during
> transactional operations (so that data isn't partially committed if the
> system crashes), but defending against general filesystem corruption is
> an entirely different animal.
> 
> Given the amount of memory mapping involved in Cyrus, asking it to
> successfully operate given a corrupt filesystem is sort of like telling
> any program to operate in the face of unreliable main memory.  Sure, if
> you're being very very careful you may be able to get some semblance of
> correct behavior, but you'll also take huge performance hit in the common
> (un-corrupted) case, and you still may not be able to survive in the
> corruption case.

I agree. However, lets just try to get the known problems sorted out, before
even thinking of a general fit-all-crashes-and-corruptions solution.

> Cyrus does provide tools to help recover from filesystem crashes.  These
> include the database recovery utilities, the chk_cyrus utility (which we
> wrote after a severe filesystem crash of our own!), reconstruct, and so
> on.

chk_cyrus is a nice addition, indeed. I cooked up a bunch of shell scripts
to do something similliar, parsing the raw mailboxes dump output and looking
for directories in the fs.

> Should we strive to do better?  Probably, but when faced with a decision
> of whether to track down a problem in Cyrus during normal operation, or
> track down a problem in Cyrus in the face of filesystem corruption, I'm
> going to have to pick the former almost every time, since it has wider
> applicability and there already exist tools to return the Cyrus data store
> back to a consistent state.

Fully agree. Well, the reason i wrote this mail in the first place could
easily be labeled as a problem during normal operation :)

> In any case, if you are so worried about resilience in your software, why
> are you using alpha quality software on your production system?

Mostly for these two reasons:
a) features 
b) testing

Virtdomains are a much needed feature, so i got a cvs snapshot running on my
personal sever as soon as Ken commited them into cvs. It worked well enough,
so i went on and got 2.2a in production.  It worked excellent untill
hardware bit me, and only then some problems came up, which Ken fixed
promptly (thanks again, Ken).

You see, i'm still young enough to afford a sleepless week or two every now
and then. And if i can help to spare someone else such expirience, i'd
gladly do so. Since i'm not a programmer myself, just a poor sysadmin, i do
the way i can: to whine about all the unusuall problems i notice :)

-- 

Jure Pecar