syncronous replication and replication cache

Sun Jun 16 08:35:11 EDT 2019

(this was originally written for a FastMail internal mailing list, but it has more general interest, so I'm posting it here too)

I figured I should write some of this up so we have a design we can all talk about, and so we know what's roughly involved in getting there.

*The goal***

We'd love to be able to run no-raid IMAP servers, but that means that if we lose a drive, we lose data unless we have a real time replica. There are basically two ways to do this:

1) drbd (aka: cross-machine RAID!)
2) synchronous replication at the Cyrus level

We consider that drbd would be too poor performance (though this isn't tested) and would mean tricky things for failover because our tooling isn't designed for it, so we're looking at synchronous replication in Cyrus.

*Making replication more efficient***

Here's the current network traffic for 3 common scenarios: C is the client (aka master) and S is the server (replica).

1) delivery of a new message to mailbox A:

C: GET MAILBOXES (A)
S: * MAILBOX (A) { header details }
S: OK
C: APPLY RESERVE (A) (GUID)
S: OK RESERVED ()
C: APPLY MESSAGE (GUID) {n+}
data
S: OK
C: APPLY MAILBOX A { header details } RECORD (%(UID GUID ...))
S: OK

2) move of a message from mailbox A to mailbox B

C: GET MAILBOXES (A B)
S: * MAILBOX (A) { header details }
S: * MAILBOX (B) { header details }
S: OK
C: APPLY RESERVE (A B) (GUID)
S: OK RESERVED (GUID)
C: APPLY MAILBOX B { header details } RECORD (%(UID GUID ...))
S: OK

3) touch a flag on a record (e.g. mark it seen as the owner - seen as non-owner is more complex and involves an APPLY META)

C: GET MAILBOXES (A)
S: * MAILBOX (A) { header details }
S: OK
C: APPLY MAILBOX A { header details } RECORD (%(UID GUID FLAGS...))
S: OK

in order to speed all these up, we have proposed a replication cache. This would know, for each mailbox, the last known state at the replica end (aka: the GET MAILBOXES response)

so:

1) becomes
C: APPLY MESSAGE (GUID) {n+}
data
S: OK
C: APPLY MAILBOX A { header details } RECORD (%(UID GUID ...))
S: OK

2) becomes
C: APPLY MAILBOX B { header details } RECORD (%(UID GUID ...))
S: OK

and; 3) becomes
C: APPLY MAILBOX A { header details } RECORD (%(UID GUID FLAGS...))
S: OK

What happened to the "APPLY RESERVE" in (2) you ask? We do on-the-fly reserve on the server by scanning through any new appends and looking for the record using the conversations DB. Yep. And the client knows it's likely there by checking createdmodseqs on existing records in ITS conversations DB, because we can do that.

*Sanity checking***

If we're going to just be doing a direct APPLY command with a cached state, then the existing race condition between the GET MAILBOXES and the APPLY MAILBOX becomes much wider. This is fine if nothing else ever modifies your replica, but sometimes things change. I already have a branch to address this with the ifInState equivalent for the replication protocol, which is three new keys in the APPLY MAILBOX: SINCE_MODSEQ, SINCE_CRC and SINCE_CRC_ANNOT, plus a new error code. This branch is already done and passes tests.

*The cache***

This would be a per-channel twoskip file on tmpfs probably, into which a dlist per mailbox would be stored.

The place to hook writing this would be after the OK response just before 'done:' in update_mailbox_once(). And also hook wiping it into any failure of update_mailbox_once so that the retries could fix things up.

The place to hook reading would probably just be sync_do_mailboxes, where the loop copying the sync_folder_list data into the MAILBOXES dlist could check the cache for each name and just transcribe that into the replica_folders list, and obviously not send a backend request if the entire list was satisfied from cache. We could also write cache with this result of the MAILBOXES call just in case it's already up to date and hence update_mailbox_once doesn't need to be called!

*The new reserve***

Skipping the current reserve call requires making the sync_server able to use its local conversations db to resolve GUIDs as needed from the conversations.db and copy the files over. This should be viable using logic from the JMAP bulk-update, but it's going to need to be copied, because the JMAP code requires lots of logic which is deep inside the jmap modules, so we can't just call it from the replication subsystem.

This saves roundtrips for the reserve call. It does depend on delayed expunge to some level, but that's OK so long as it can recover from a "missing GUID on the replica" failure because its estimate of what was in other folders was wrong!

*A local lock***

While any sync_client is replicating a mailbox, it will need to take a local lock on $SYNC-$channel-$mboxname! This is so that we don't have two processes trying to sync the same channel at the same time! Rolling sync could try non-blocking for that name and just punt it (aka: sync_log it to itself again) if the lock is busy.

*The realtime sync***

And now we get to the meat of this! We want to allow multiple parallel real-time sync calls, but probably not hundreds. I suggest that we use a sync_client-style tool which runs as a daemon within cyrus, listening probably on a UNIX socket and keeping a long running backend connection open, so there may be a handful of syncd / sync_server parings running at any one time, servicing random requests.

The config would be something like

sync_realtime_socket: [% confdir %]/socket/syncd

and in cyrus.conf

 syncd cmd="syncd [% CONF %] -n [ %buddyslotchannel %]" listen="[% confdir %]/socket/syncd"

(aka: use the named channel to choose the backend for realtime sync)

The client code would keep EXACTLY the current sync_log support to all channels, but have the following very small modification added: at the end of mailbox_unlock_index or mailbox_close or wherever seems the most sane, we add a synchronous callout. It might even go up a layer, though there's many more places to think about. Anyway, this callout connects to the local syncd on the specified port and asks it to replicate just the one mailbox name to the backend, and waits for the reply. If the reply is a failure, then that gets syslogged, but otherwise the return is still success to the rest of the code.

The end result of hooking at this point is that EVERY protocol which replies success after committing the changes will not reply back to the client until the data is replicated to the buddy machine (buddy naming is hearkening back to the short-lived career of Max at FastMail and his buddyslot idea!)

The rolling sync_client will pick up the logged name still, but the good news is, it will either still be locked (local lock) and hence try later, or it will already be up to date, and the local state will match the cached state, so there's no network IO or heavy calculations at all!

This lack of network IO at all for the followup case is why I think the cache and remote conversations-based GUID lookup (and local validation that it's worth trying) is worth doing first, because that means that synchronous replication stands a chance of being efficient enough to be sane!

I _think_ this design is good. Please let me know if anything about how it's supposed to work is unclear, or if you have a better idea.

In terms of the split if work: I envisage Ken or ellie writing the syncd stuff. I've already written the ifInState stuff, so I'll just ship that. The caching could be anybody, it's pretty easy, and the GUID lookup for master/replica could be me, or could be someone else if they're keen to have a look at it - I have a very clear idea of how that would look.

Cheers,

Bron.

--
 Bron Gondwana, CEO, FastMail Pty Ltd
 brong at fastmailteam.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.andrew.cmu.edu/pipermail/cyrus-devel/attachments/20190616/83ca973c/attachment.html>