Making Replication Robust

Fri Oct 19 05:22:37 EDT 2007

On Tue, 9 Oct 2007, David Carter wrote:

> I've never faced a spilt brain situation which involved more than two or 
> three messages (the outstanding log on an old master system).

I suppose that it was predicable that a week after writing this I faced my 
first serious split brain (3000 messages "lost" after a hardware fault).

My solution was to write a little script which given a list of mailboxes 
(the sync_log file on the old master), scanned over the cyrus.index files 
looking for messages with an internaldate greater than a given cutoff.

These messages were then transferred across to the new master to be 
reinjected. Replication from the new master to the old master then 
resolves the split brain situation (master wins in case of ambiguity), 
which is the way it was designed to work. From memory Fastmail did 
something similar when they faced a split brain situation.

The procedure works well, but I think that it would be useful to have some 
tools in the Cyrus distribution rather than having to knock up one off 
tools. I'm happy to work on this if we can beat out some requirements.

I'm not keen about trying to fix split brain situations within the 
replication protocol itself: at the moment sync_client doesn't try to mess 
with the data on the master, which is a property I like.

There are also certain situations that replication just can't fix.

Envision a hypothetical replication engine which can cope with GUID 
mismatches, adding messages to both master and replica. Then imagine:

1) Replication dies because of hardware or software fault.

2) Master continues to limp along for a bit before dying. Split brain.

3) Message delivered to a non user mailbox (Sieve or + addressing)

4) Master dies entirely: failover to replica with missing messages

5) User logs in and deletes the mailbox in question on the new master,
    unaware that they are actually missing a message from that mailbox.

6) sysadmin starts replication from new master to old master. They hope
    that this will automatically resolve all conflicts without losing
    anything because we promise that replication is magic.

6) Replication engine deletes the entire mailbox (including the message
    that we want to recover), as it doesn't exist on the new master.

/* ================================================================== */

Just for everyone's amusement: what happened to us on Tuesday evening
=====================================================================

This isn't good:

   Oct 16 20:56:21 cyrus-24 kernel:
     Uhhuh. NMI received for unknown reason 21 on CPU 0.
   Oct 16 20:56:21 cyrus-24 kernel:
     Dazed and confused, but trying to continue
   Oct 16 20:56:21 cyrus-24 kernel:
     Do you have a strange power saving mode enabled?

But it is nowhere near as bad as:

   Oct 16 20:56:31
     cyrus-24 sync_client[11985]: Unknown system flag: \snswered
                                                        ^ Oops

You know that a machine is unhappy when sync_client -u on a given
account randomly:

   1) Works without problems
   2) segfaults
   3) Attempts to reserve every message on the account on the server,
      presumably as a prelude to a mass UPLOAD.

I infer that the machine has a motherboard fault which caused kernel 
memory corruption in some small lump of buffer cache. I am amazed that the 
filesystems passed fsck when I attached the disks to a new machine. The 
original machine refused to reboot cleanly because umount segfaulted. It 
also failed two DIMMs on each POST until the machine ran out of memory.

-- 
David Carter                             Email: David.Carter at ucs.cam.ac.uk
University Computing Service,            Phone: (01223) 334502
New Museums Site, Pembroke Street,       Fax:   (01223) 334679
Cambridge UK. CB2 3QH.