<!DOCTYPE html><html><head><title></title><style type="text/css">p.MsoNormal,p.MsoNoSpacing{margin:0}

p.MsoNormal,p.MsoNoSpacing{margin:0}</style></head><body><div style="font-family:Arial;">(this was originally written for a FastMail internal mailing list, but it has more general interest, so I'm posting it here too)<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">I figured I should write some of this up so we have a design we can all talk about, and so we know what's roughly involved in getting there.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;"><b>The goal</b><b></b><br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">We'd love to be able to run no-raid IMAP servers, but that means that if we lose a drive, we lose data unless we have a real time replica.  There are basically two ways to do this:<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">1) drbd (aka: cross-machine RAID!)<br></div><div style="font-family:Arial;">2) synchronous replication at the Cyrus level<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">We consider that drbd would be too poor performance (though this isn't tested) and would mean tricky things for failover because our tooling isn't designed for it, so we're looking at synchronous replication in Cyrus.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;"><b>Making replication more efficient</b><b></b><br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">Here's the current network traffic for 3 common scenarios:  C is the client (aka master) and S is the server (replica).<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">1) delivery of a new message to mailbox A:<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">C: GET MAILBOXES (A)<br></div><div style="font-family:Arial;">S: * MAILBOX (A) { header details }<br></div><div style="font-family:Arial;">S: OK<br></div><div style="font-family:Arial;">C: APPLY RESERVE (A) (GUID)<br></div><div style="font-family:Arial;">S: OK RESERVED ()<br></div><div style="font-family:Arial;">C: APPLY MESSAGE (GUID) {n+}<br></div><div style="font-family:Arial;">data<br></div><div style="font-family:Arial;">S: OK<br></div><div style="font-family:Arial;">C: APPLY MAILBOX A { header details } RECORD (%(UID GUID ...))<br></div><div style="font-family:Arial;">S: OK<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">2) move of a message from mailbox A to mailbox B<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">C: GET MAILBOXES (A B)<br></div><div style="font-family:Arial;">S: * MAILBOX (A) { header details }<br></div><div style="font-family:Arial;">S: * MAILBOX (B) { header details }<br></div><div style="font-family:Arial;">S: OK<br></div><div style="font-family:Arial;">C: APPLY RESERVE (A B) (GUID)<br></div><div style="font-family:Arial;">S: OK RESERVED (GUID)<br></div><div style="font-family:Arial;">C: APPLY MAILBOX B { header details } RECORD (%(UID GUID ...))<br></div><div style="font-family:Arial;">S: OK<br></div><div style="font-family:Arial;"><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">3) touch a flag on a record (e.g. mark it seen as the owner - seen as non-owner is more complex and involves an APPLY META)<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">C: GET MAILBOXES (A)<br></div></div><div style="font-family:Arial;">S: * MAILBOX (A) { header details }<br></div><div style="font-family:Arial;">S: OK<br></div><div style="font-family:Arial;">C: APPLY MAILBOX A { header details } RECORD (%(UID GUID FLAGS...))<br></div><div style="font-family:Arial;">S: OK<br></div><div style="font-family:Arial;"><div style="font-family:Arial;"><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">in order to speed all these up, we have proposed a replication cache.  This would know, for each mailbox, the last known state at the replica end (aka: the GET MAILBOXES response)<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">so:<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">1) becomes<br></div></div></div><div style="font-family:Arial;">C: APPLY MESSAGE (GUID) {n+}<br></div><div style="font-family:Arial;">data<br></div><div style="font-family:Arial;">S: OK<br></div><div style="font-family:Arial;">C: APPLY MAILBOX A { header details } RECORD (%(UID GUID ...))<br></div><div style="font-family:Arial;">S: OK<br></div><div style="font-family:Arial;"><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">2) becomes<br></div></div><div style="font-family:Arial;">C: APPLY MAILBOX B { header details } RECORD (%(UID GUID ...))<br></div><div style="font-family:Arial;">S: OK<br></div><div style="font-family:Arial;"><div style="font-family:Arial;"><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">and; 3) becomes<br></div><div style="font-family:Arial;">C: APPLY MAILBOX A { header details } RECORD (%(UID GUID FLAGS...))<br></div></div></div><div style="font-family:Arial;">S: OK<br></div><div style="font-family:Arial;"><div style="font-family:Arial;"><div style="font-family:Arial;"><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">What happened to the "APPLY RESERVE" in (2) you ask?  We do on-the-fly reserve on the server by scanning through any new appends and looking for the record using the conversations DB.  Yep.  And the client knows it's likely there by checking createdmodseqs on existing records in ITS conversations DB, because we can do that.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;"><b>Sanity checking</b><b></b><br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">If we're going to just be doing a direct APPLY command with a cached state, then the existing race condition between the GET MAILBOXES and the APPLY MAILBOX becomes much wider.  This is fine if nothing else ever modifies your replica, but sometimes things change.  I already have a branch to address this with the ifInState equivalent for the replication protocol, which is three new keys in the APPLY MAILBOX:  SINCE_MODSEQ, SINCE_CRC and SINCE_CRC_ANNOT, plus a new error code.  This branch is already done and passes tests.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;"><b>The cache</b><b></b><br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">This would be a per-channel twoskip file on tmpfs probably, into which a dlist per mailbox would be stored.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">The place to hook writing this would be after the OK response just before 'done:' in update_mailbox_once().  And also hook wiping it into any failure of update_mailbox_once so that the retries could fix things up.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">The place to hook reading would probably just be sync_do_mailboxes, where the loop copying the sync_folder_list data into the MAILBOXES dlist could check the cache for each name and just transcribe that into the replica_folders list, and obviously not send a backend request if the entire list was satisfied from cache.  We could also write cache with this result of the MAILBOXES call just in case it's already up to date and hence update_mailbox_once doesn't need to be called!<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;"><b>The new reserve</b><b></b><br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">Skipping the current reserve call requires making the sync_server able to use its local conversations db to resolve GUIDs as needed from the conversations.db and copy the files over.  This should be viable using logic from the JMAP bulk-update, but it's going to need to be copied, because the JMAP code requires lots of logic which is deep inside the jmap modules, so we can't just call it from the replication subsystem.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">This saves roundtrips for the reserve call.  It does depend on delayed expunge to some level, but that's OK so long as it can recover from a "missing GUID on the replica" failure because its estimate of what was in other folders was wrong!<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;"><b>A local lock</b><b></b><br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">While any sync_client is replicating a mailbox, it will need to take a local lock on $SYNC-$channel-$mboxname!  This is so that we don't have two processes trying to sync the same channel at the same time!  Rolling sync could try non-blocking for that name and just punt it (aka: sync_log it to itself again) if the lock is busy.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;"><b>The realtime sync</b><b></b><br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">And now we get to the meat of this!  We want to allow multiple parallel real-time sync calls, but probably not hundreds.  I suggest that we use a sync_client-style tool which runs as a daemon within cyrus, listening probably on a UNIX socket and keeping a long running backend connection open, so there may be a handful of syncd / sync_server parings running at any one time, servicing random requests.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">The config would be something like<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">sync_realtime_socket: [% confdir %]/socket/syncd<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">and in cyrus.conf<br></div><div style="font-family:Arial;"><br></div></div></div></div><div style="font-family:Arial;">  syncd cmd="syncd [% CONF %] -n [ %buddyslotchannel %]" listen="[% confdir %]/socket/syncd"<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">(aka: use the named channel to choose the backend for realtime sync)<br></div><div style="font-family:Arial;"><div style="font-family:Arial;"><div style="font-family:Arial;"><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">The client code would keep EXACTLY the current sync_log support to all channels, but have the following very small modification added: at the end of mailbox_unlock_index or mailbox_close or wherever seems the most sane, we add a synchronous callout.  It might even go up a layer, though there's many more places to think about.  Anyway, this callout connects to the local syncd on the specified port and asks it to replicate just the one mailbox name to the backend, and waits for the reply.  If the reply is a failure, then that gets syslogged, but otherwise the return is still success to the rest of the code.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">The end result of hooking at this point is that EVERY protocol which replies success after committing the changes will not reply back to the client until the data is replicated to the buddy machine (buddy naming is hearkening back to the short-lived career of Max at FastMail and his buddyslot idea!)<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">The rolling sync_client will pick up the logged name still, but the good news is, it will either still be locked (local lock) and hence try later, or it will already be up to date, and the local state will match the cached state, so there's no network IO or heavy calculations at all!<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">This lack of network IO at all for the followup case is why I think the cache and remote conversations-based GUID lookup (and local validation that it's worth trying) is worth doing first, because that means that synchronous replication stands a chance of being efficient enough to be sane!<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">I _think_ this design is good.  Please let me know if anything about how it's supposed to work is unclear, or if you have a better idea.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">In terms of the split if work: I envisage Ken or ellie writing the syncd stuff.  I've already written the ifInState stuff, so I'll just ship that.  The caching could be anybody, it's pretty easy, and the GUID lookup for master/replica could be me, or could be someone else if they're keen to have a look at it - I have a very clear idea of how that would look.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">Cheers,<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">Bron.<br></div><div style="font-family:Arial;"><br></div></div></div></div><div id="sig56629417"><div style="font-family:Arial;">--<br></div><div class="signature">  Bron Gondwana, CEO, FastMail Pty Ltd<br></div><div class="signature">  brong@fastmailteam.com<br></div><div class="signature"><br></div></div><div style="font-family:Arial;"><br></div></body></html>