High-Availability IMAP server

Wed Sep 28 04:02:43 EDT 2005

On Tue, 27 Sep 2005, Patrick Radtke wrote:

> We made great use of it Monday morning when one of our backend machines 
> failed. Switching to the replica was quite simple and relatively fast 
> (maybe 5 to 10 minutes from deciding to switch to the replica before 
> replica was fully in action)

We use the replication engine all the time to move users back and forth 
between systems so that we can patch and upgrade operating systems and/or 
Cyrus without any user visible downtime.

There have also been a number of forced failovers because of hardware 
problems, specifically some dodgy RAID controller firmware that we were 
running for a few months until we got a fix. Its worked very nicely for 
us, but it is important that people don't just trust the software blindly. 
We maintain and constantly regenerate a database of MD5 checksums for all 
of the messages and cache entries across the cluster. Its been a long time 
now since this has turned up errors, but I still check it religiously.

> I consider the code to stable, though on occasion strange things happen

Which is not really my definition of stable :).

> (e.g. when user renames user.INBOX to user.saved.INBOX) and you have to 
> restart the replication process (no downtime to Cyrus involved).

This one is odd behaviour on the part of mboxlist_renamemailbox(): it does 
special magic when running as a non-admin user. There's actually a more 
serious underlying bug in Cyrus here which I believe Ken is working on.

Again we don't see this one. Partly because our replication engine doesn't 
run as an admin user (afraid you don't have that option), partly because 
of overenthusiastic hacking on my part in other parts of the Cyrus code.

-- 
David Carter                             Email: David.Carter at ucs.cam.ac.uk
University Computing Service,            Phone: (01223) 334502
New Museums Site, Pembroke Street,       Fax:   (01223) 334679
Cambridge UK. CB2 3QH.