Today's pop quiz: replication

Thu Jul 23 02:14:38 EDT 2015

Here's my understanding, and my understanding is limited and probably
incorrect, so I'd appreciate corrections from anyone who actually knows
this stuff.

> Do we have multiple sync_clients because a new one is spawned by a
> master for each change (and then the process finishes), or is there an
> actual pool of sync clients which handle each change and persist idle.

> I failed to note where the sync_server fits in. Is it a separate process
> that lives alongside the master imap server and sits there constantly
> checking the log files generated by sync_client to be handled?

I'm going to make up some terminology here because there's lots of
reuses of the word "master" everywhere:

- the "primary" server for a given user/mailbox is the server which that
user actually interacts with
- a "replica" server for a given user/mailbox is another server which
contains a copy of their data
- there may be multiple replicas for a given user/mailbox, but only one
primary
- a single cyrus instance may be the primary server for some users but a
replica server for other users
- there may even be multiple cyrus instances running on the same
physical machine (let's ignore this though)

A cyrus instance that is to be a replica server for some users needs to
run the sync_server program*.  This listens for replication attempts
from a primary server.

A cyrus instance that is to be a primary server for some users needs to
run the sync_client program.  This generally runs in "rolling" mode,
whereby it continuously processes the sync log** for changes to
mailboxes and sends them to the sync_server on the replica.  If a
primary is replicating to multiple replicas, it will generally multiple
sync_clients, one for each.  It's also possible to chain replication
(like primary -> replica_a -> replica_b) but let's ignore that too.

An administrator can also run sync_client manually -- e.g. for
pre-populating a replica prior to starting rolling replication.

The sync_server program will periodically be shut down and restarted by
master (i.e. the process called master) -- I think there's some config
for specifying how long one should hang around for before restarting.  I
guess this protects a production service against possibly memory leaks. 
I'm not sure if this applies to sync_client too.

[* or have imapd configured to provide replication services, but let's
ignore this too]
[** there may be multiple of these depending on sync channel
configurations, but let's ignore this too]

> Is the file format of the sync log defined anywhere? I assume it
> correlates with a set of commands. (Not that this is important to a
> user: it may as well be opaque, but it made me wonder!)

I'm a bit confused about this myself.  Each time I go digging into the
code my understanding flips back the opposite way.

I think, either:

* the sync log contains all the information needed to reproduce what's
happened (e.g. if a message has arrived, the sync log will contain the
message itself); OR
* the sync log contains just enough to identify things that have changed
(e.g. if a message has arrived, the sync log contains a message id of
some sort), and the sync_client processing the log just uses the log to
discover which things to sync, but then uses the actual mailbox to
construct the changes to send to the replica.

Either way I haven't seen any documentation on the sync log format.  I
suspect it's either the raw sync protocol or some subset thereof?

> I also have in my wonderful drawing a picture of a number of channels. I
> assume these are (part of) a config given to the sync_server so it knows
> where to broadcast all the changes defined in the log files to? Or have
> I misunderstood what a channel is?

All of the above assumes a single default channel.  You can configure as
many channels as you want.

Each channel has a sync log.  When actions occur on mailboxes (e.g. via
imapd, popd, lmtpd, sync_server, etc) the actions are logged to the sync
log for all the channels.

A single sync_client processes the sync log for a single channel.

If you wanted primary to replicated simultaneously to replica_a and
replica_b, you might set up a channel and corresponding sync_client for
each replica.  (The other way to do it would be with chained replication
i.e. where primary replicates to replica_a, and replica_a replicates to
replica_b -- but in this case if something went wrong with replica_a,
replica_b would get stale, which seems unideal.)

There's also a program called "squatter", which is used for updating
search indexes.  It monitors a channel by the same name, and updates the
search index for things that it sees change.

I've found the doc/install-replication.html document in the repo helpful
for understanding how this stuff fits together, though it's lacking the
deeper detail I actually need.

I started making a wiki page about this a few days ago but haven't
updated since, and I've come to understand most of this since then.  So
maybe this email plus whatever corrections arrive from it would make
good content for it: https://git.cyrus.foundation/w/replication/

Hoping for confirmations/corrections,

ellie

On Thu, Jul 23, 2015, at 03:18 PM, Nicola Nye wrote:
> Hi Cyrus,
> 
> I'm currently working on some basic architecture diagrams that will
> complement the documentation to show the moving parts. Today's topic:
> replication.
> 
> Based on a delightfully drawn whiteboard session with Bron, I am left
> with a couple of queries:
> 
> Do we have multiple sync_clients because a new one is spawned by a
> master for each change (and then the process finishes), or is there an
> actual pool of sync clients which handle each change and persist idle.
> 
> Is the file format of the sync log defined anywhere? I assume it
> correlates with a set of commands. (Not that this is important to a
> user: it may as well be opaque, but it made me wonder!)
> 
> I failed to note where the sync_server fits in. Is it a separate process
> that lives alongside the master imap server and sits there constantly
> checking the log files generated by sync_client to be handled?
> 
> I also have in my wonderful drawing a picture of a number of channels. I
> assume these are (part of) a config given to the sync_server so it knows
> where to broadcast all the changes defined in the log files to? Or have
> I misunderstood what a channel is?
> 
>    Nicola