Replication documentation anywhere?

Bron Gondwana brong at fastmail.fm
Thu May 23 21:10:16 EDT 2013


On Fri, May 24, 2013, at 01:34 AM, Karl Pielorz wrote:
> We're running cyrus-imapd-2.4.17 on FreeBSD. I've been looking at the
> replication built into Cyrus, but can't find much (if any)
> documentation on it.
>
> e.g. The shipped 'install-replication.html' file ends at:
>
> " ... You can also run cyr_synclog(8) instead, which will insert the
> record into the rolling replication log.
>
> Failover "
>
> And that's it. Is there anywhere I can find more info on replication /
> failover / setup, a 'howto' - or anything?

I'm afraid there isn't much :(  Feel free to ask questions about
specific things you run in to, and we can use that as a basis to put
together more detailed documentation.

Failover is kind of messy at the moment, because it's so site-dependent
how you want to manage your failover.  Our process at FastMail looks
like this:

1) update database to mark the server as "moving" so new connections get
   paused at nginx/web server level and then wait 2 seconds for the
   config to be updated.
2) send a signal to the 'master' process to shut the server down.
3) wait for up to 10 seconds while grepping the process list every
   second for ongoing processes related to the same instance (we use -C
   $imapd_conf because we run many instances of Cyrus per server)
4) if the processes aren't dead after 10 seconds, kill them
   individually.
5) if THAT fails, kill -9.  (yeah, I know - evil!)
6) check the $confdir/sync directory for log files, and run them with
   sync_client to ensure all replication is up to date.

If anything before this failed, we bring this master back up and report
that the failover didn't succeed.

7) shut down the replica
8) restart this side with the replica configuration
9) restart the other side with the master configuration

At the moment, we still move a master IP address to the instance which
is running as master, meaning clients can reconnect to the same IP
address after the failover.  This is on its way out - we're now at the
point where almost everything can read configuration from our
"fmstatus.json" file which is updated every second on every host, so
they know where the master is actually located.

Obviously, a ton of this is really site-specific to us.

Soon (yes, soon!) we will be shifting to a full multi-master setup,
where failover is as simple as pointing clients to the other end of
the replication pair, and killing off existing connections so they
reconnect (with some sync_log checking and force-running), which should
shave quite a few seconds off the sync time and mean that long running
squatter jobs and other things don't get nuked off at the same time.

But yeah, it's not quite a turn-key system :(

Bron.

-- 
  Bron Gondwana
  brong at fastmail.fm


More information about the Cyrus-devel mailing list