Making Replication Robust
brong at fastmail.fm
Wed Oct 3 22:36:57 EDT 2007
As I've mentioned on the mailing list, we have had to put
quite a lot of infrastructure around Cyrus to make
replication robust in all cases.
While the core replication protocol seems pretty stable now,
and with GUID stuff it will be easier to do integrity checks,
it's still very much not a turn-key solution. Every site
will have to put a lot of effort into understanding how
everything works and building their own systems for keeping
... while there is some commercial advantage theory in us
knowing how to do stuff and not telling anyone else, we
think that's outweighed by having replication so good that
lots of people are using it and helping improve Cyrus ...
So I'd like to start a dialogue on the topic of making Cyrus
replication robust across failures with the following goals:
a) MUST never lose a message that's been accepted for
delivery except in the case of total drive failure.
b) MUST have a standard way to integrity check and
repair a replica-pair after a system crash.
c) MUST have a clean process to "soft-failover" to the
replica machine, making sure that all replication
events from the ex-master have been synchronised.
d) MUST have replication start/restart automatically when
the replica is available rather than requiring it be
online at master start time.
e) SHOULD be able to copy back messages which only exist
on the replica due to a hard-failover, handling UIDs
gracefully (more on this later), alternatively as least
MUST (to satisfy point 'a') notify the administrator
that the message has different GUIDs on the two copies
and something will need to be done about it (to satisfy
point 'd' this must be done without bailing out
replication for the remaining messages in the folder)
f) SHOULD keep replicating in the face of an error which
affects a single mailbox, keeping track of that mailbox
so that a sysadmin can fix the issue and then replicate
that mailbox hand.
g) MAY have a method to replicate to two different replicas
concurrently (replay the same sync_log messages twice)
allowing one replica to be taken out of service and
a new one created while having no "gaps" in which there
is no second copy alive (we use rsync, rsync again,
stop replication, rsync a third time, start replication
to the new site - but it's messy and gappy)
Does that sound like a reasonable set of goals to everyone?
Have I missed anything important? I'd like to see a
situtation where everyone who wants to run replication can
turn it on and pretty much forget about it, trusting that
it will just_work[tm] (though reading the log files just in
COPYING BACK and UIDs:
The easy approach:
* give both messages a new UID after uid_last and delete the
old UIDs on both machines
The tricky approach:
* track the highest UID that has ever been presented to an
IMAP client. UIDs above that point are "soft" UIDs, and
you don't need to worry about changing them, so select
whichever end has actually had the UID seen and keep the
message there with the same UID, injecting only the other
message with a new UID.
* if both ends have been viewed via IMAP then your client is
probably already confused. Fall back to nuking that UID
and injecting both messages again with higher UIDs.
More information about the Cyrus-devel