Making Replication Robust

Mon Oct 8 04:21:13 EDT 2007

On Mon, Oct 08, 2007 at 10:03:31AM +0200, Rudy Gevaert wrote:
> For me points a, e and f are most important, but the others are also 
> important.
>
> Bron Gondwana wrote:
>
>> So I'd like to start a dialogue on the topic of making Cyrus
>> replication robust across failures with the following goals:
>> a) MUST never lose a message that's been accepted for    delivery except 
>> in the case of total drive failure.
>> b) MUST have a standard way to integrity check and    repair a 
>> replica-pair after a system crash.
>
> Do you mean that if the replica crashes it should be able to catch up 
> again?

No, when a master fails and replication wasn't 100% up to date
and you decided to bring the replica online, then later switched
back to the original master, you don't overwrite messages.

>> c) MUST have a clean process to "soft-failover" to the    replica machine, 
>> making sure that all replication
>>    events from the ex-master have been synchronised.
>
> In deed this is nice, but it would still need a lot of site specific tools. 
>  E.g. I know (I think I do) that Fastmail runs master/replica in the same 
> subnet.  We don't.  So soft-failover isn't that easy.

True - it's easy for us because we have different configs that bind
to the same IP address and use arp broadcasts so nothing else needs
to change.

The bit I care more about is that you can shut a master down cleanly
and guarantee that all replication events finish sending as part of
the shutdown process.  We already do this with an external init script
(written in Perl) but would prefer that it's a general option available
to everyone and supported upstream.

> For us it's more important that all mail that isn't delivered gets queued 
> at the MTA (it's not on the same machine as cyrus).  All delivered mails 
> are replicated. We then still need to update the DNS or /etc/hosts file.

We have that too of course, it's more the ones that are delivered but
not yet replicated when we call shutdown that matter (see also APPEND)

>> d) MUST have replication start/restart automatically when
>>    the replica is available rather than requiring it be    online at 
>> master start time.
>
> This would be great if there are some tools available for doing automatic 
> failover, recovery, ...

Yeah, we get this with the '-o' option to sync_client meaning it
just doesn't start replicating, but then monitorsync.pl runs every
10 minutes from cron, and it checks that there are running 
sync_client processes for each master and attempts to start them
if the replica is marked as "up" in the database.  It also deals
with old log files left lying around.

>> e) SHOULD be able to copy back messages which only exist
>>    on the replica due to a hard-failover, handling UIDs    gracefully 
>> (more on this later), alternatively as least
>>    MUST (to satisfy point 'a') notify the administrator
>>    that the message has different GUIDs on the two copies
>>    and something will need to be done about it (to satisfy
>>    point 'd' this must be done without bailing out    replication for the 
>> remaining messages in the folder)
>> f) SHOULD keep replicating in the face of an error which
>>    affects a single mailbox, keeping track of that mailbox
>>    so that a sysadmin can fix the issue and then replicate
>>    that mailbox hand.
>> g) MAY have a method to replicate to two different replicas
>>    concurrently (replay the same sync_log messages twice)
>>    allowing one replica to be taken out of service and
>>    a new one created while having no "gaps" in which there
>>    is no second copy alive (we use rsync, rsync again,
>>    stop replication, rsync a third time, start replication
>>    to the new site - but it's messy and gappy)
>
> Is again a good idea, and would be very usable.  But this is depending what 
> you will be doing with the second replica.  If it would be possible to take 
> out the second replica, to make it conssistent and back it up, and then 
> make it up to date it would be a neat way have consistent backup.

Yeah, that's a point.  That would be very nice :)  We're generally
doing it because we want to take a drive unit out of service, or
even a whole machine, and we'd rather not have a gap where there's
only one live copy of data.

I've been thinking evil thoughts about writing a sync_server protocol
compatibility library and poking cyrus through it.  We already run a
sync_server on our masters as well because we use it for user moves:

*) create custom config file and mailboxes.db snippet
*) sync user to new store using custom config
*) lock the user against lmtp/pop/imap at the proxy level and
   kill off all current connections (scans $confdir/proc)
*) sync user again
*) run "checkreplication.pl" in paranoid mode to make sure
   everything actually matches
*) update database field for store name and broadcast a cache
   invalidation packet to all the apps that cache user data
   (again, one subnet makes broadcast cache management reasonable)
*) re-enable delivery and logins.

Generally takes about 15 seconds for the "critical path" bit, and
the initial sync doesn't matter how long it takes.

But yeah - you could do fun things if you knew how to talk sync
protocol!

Bron.