Making Replication Robust

Rudy Gevaert Rudy.Gevaert at UGent.be
Mon Oct 8 04:03:31 EDT 2007


Hello,

I agree with Bron.  However I do think some parts are more important 
than others.  I'll try to explain my point of view.

Note, we are running 2.3.7, I'm going to upgrade when 2.3.10 is out.  We 
have replication in place, but daren't use it.  If I have a method to 
check if the replica is in sync then I'll dare to do a fail over.

For me points a, e and f are most important, but the others are also 
important.

Bron Gondwana wrote:

> So I'd like to start a dialogue on the topic of making Cyrus
> replication robust across failures with the following goals:
> 
> a) MUST never lose a message that's been accepted for 
>    delivery except in the case of total drive failure.
> 
> b) MUST have a standard way to integrity check and 
>    repair a replica-pair after a system crash.

Do you mean that if the replica crashes it should be able to catch up again?

> 
> c) MUST have a clean process to "soft-failover" to the 
>    replica machine, making sure that all replication
>    events from the ex-master have been synchronised.

In deed this is nice, but it would still need a lot of site specific 
tools.  E.g. I know (I think I do) that Fastmail runs master/replica in 
the same subnet.  We don't.  So soft-failover isn't that easy.

For us it's more important that all mail that isn't delivered gets 
queued at the MTA (it's not on the same machine as cyrus).  All 
delivered mails are replicated. We then still need to update the DNS or 
/etc/hosts file.

> d) MUST have replication start/restart automatically when
>    the replica is available rather than requiring it be 
>    online at master start time.

This would be great if there are some tools available for doing 
automatic failover, recovery, ...

> e) SHOULD be able to copy back messages which only exist
>    on the replica due to a hard-failover, handling UIDs 
>    gracefully (more on this later), alternatively as least
>    MUST (to satisfy point 'a') notify the administrator
>    that the message has different GUIDs on the two copies
>    and something will need to be done about it (to satisfy
>    point 'd' this must be done without bailing out 
>    replication for the remaining messages in the folder)
> 
> f) SHOULD keep replicating in the face of an error which
>    affects a single mailbox, keeping track of that mailbox
>    so that a sysadmin can fix the issue and then replicate
>    that mailbox hand.
> 
> g) MAY have a method to replicate to two different replicas
>    concurrently (replay the same sync_log messages twice)
>    allowing one replica to be taken out of service and
>    a new one created while having no "gaps" in which there
>    is no second copy alive (we use rsync, rsync again,
>    stop replication, rsync a third time, start replication
>    to the new site - but it's messy and gappy)

Is again a good idea, and would be very usable.  But this is depending 
what you will be doing with the second replica.  If it would be possible 
to take out the second replica, to make it conssistent and back it up, 
and then make it up to date it would be a neat way have consistent backup.

Kind regards,

Rudy


-- 
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Rudy Gevaert          Rudy.Gevaert at UGent.be          tel:+32 9 264 4734
Directie ICT, afd. Infrastructuur ICT Department, Infrastructure office
Groep Systemen                    Systems group
Universiteit Gent                 Ghent University
Krijgslaan 281, gebouw S9, 9000 Gent, Belgie               www.UGent.be
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --


More information about the Cyrus-devel mailing list