Making Replication Robust

Tue Oct 9 19:50:14 EDT 2007

>> c) MUST have a clean process to "soft-failover" to the
>>   replica machine, making sure that all replication
>>   events from the ex-master have been synchronised.
>
> Something more than sync_shutdown_file plus automatic retries on
> recent work files?

I think the problem at the moment is that the process you really want is:

1. Stop new imap/pop/lmtp/sieve/etc connections
2. Finish and close existing connections cleanly but as quickly as possible
3. Finish running any sync log files
4. Fully shutdown

There's currently no clean way to do this. Basically you have to SIGTERM 
master which hard kills it and all children, then manually run 
sync_client -f on any remaining log files.

We've got a patch which makes master handle SIGQUIT much more nicely. 
Basically it appears there was some existing infrastructure that was 
designed to handle a cleaner shutdown, look at the code to all the places 
that call signals_poll(). It looks like the idea was that you could send 
child processes SIGQUIT and they would continue their current action until 
their "main loop" and check if they'd been sent a QUIT, and then exit 
cleanly. Unfortunately if you sent SIGQUIT to master, it would just SIGTERM 
all children, not SIGQUIT them.

This patch attempts to fix this, so that sending SIGQUIT to master, sends 
SIGQUIT to all children, and then waits for them to all exit cleanly.

http://cyrus.brong.fastmail.fm/#cyrus-clean-shutdown-2.3.8.diff

This solves step 1 & 2 above, though it doesn't deal with the case of a 
"crazy child" that doesn't respond to SIGQUIT. Personally our init script 
sends SIGQUIT, and if the master process is still there after 10 seconds, 
then it sends SIGTERM to force and exit. In general we find that everything 
exits after a couple of seconds of SIGQUIT.

To do step 3, I think the best might be to have a new cyrus.conf section, a 
SHUTDOWN section which gives some commands to run on shutdown. Basically 
after all children have accepted a SIGQUIT and exited, then we run the 
SHUTDOWN section, which would run a final sync_client -r on the sync dir to 
finish up any remaining log files.

With all of that in place, it means you could send a SIGQUIT to a cyrus 
master process on a master server, and it would cleanly shutdown all 
children and ensure that all replication events have been correctly played 
to the replica. You could then do the same to the replica, then reverse 
their roles, and bring them both back up and you've got a safe soft 
failover.

> At the moment we replace messages (on the "master knows best" principle).
>
> It would be easy enough to leave message in place and generate warnings 
> instead, although this would generate a lot of warnings, one for every bad 
> message every time that a given mailbox is updated.

That's what this patch does.

http://cyrus.brong.fastmail.fm/#cyrus-warnmismatcheduuids-2.3.8.diff

In theory with clean soft failovers, you should NEVER have UIDs with 
mismatched UUIDs. After a hard failover, you obviously might, but in those 
cases, just replacing the message means we're almost certainly overwriting a 
delivered message and loosing it which is bad. At least making it an option 
to overwrite or log I think is a sane idea.

> My nightmare scenario is a replication engine which carries on running in 
> the face of mboxlist corruption on the master: you could lose a lot of 
> mailboxes on the replica that way.

That would be bad, though hard to detect and stop. I guess that's what 
backups are for...

> It would be easy enough to generate multiple replication log files.
>
> MySQL keeps a single transaction log for multiple replicas, but that file 
> contains quite a lot of information about each transaction. In contrast 
> the Cyrus sync log is just a list of objects we need to pay attention to: 
> the files have much less state, particularly without duplicates.

The other option is rather than using the "rotate log, play it, delete it" 
system, you generate one log file but you keep track of "offsets" within the 
file to tell you where each replica is up to. That's what mysql does, so you 
can have multiple replicas because each replica is "playing" off the same 
log files, they're just up to different offsets at any point in time.

Rob