sync_client behaviour improvement planning

Bron Gondwana brong at
Tue Feb 27 22:35:52 EST 2018

Followup on this - I pushed a commit to master which included a few
comments highlighting places in the code where changes would help.
ellie is going to work on the changes, to be reviewed by Partha.   None
of these changes make any alterations to the replication protocol, all
they change is:
1) which mailboxes are batched together when multiple mailboxes are
   being checked at once (SYNC GET MAILBOXES)
2) what gets done in interesting cases (e.g. when a folder named in the
   sync log is present on the replica but not on the master)


On Mon, 26 Feb 2018, at 18:28, Bron Gondwana wrote:
> Hey - here's me posting something to the public list instead of
> internal FastMail slack.  We've been really bad at making our random
> ruminations public, sorry.> 
> Tomorrow (Tues 27th) 2pm Melbourne time, I'm going to be meeting with
> ellie and maybe Partha in the Melbourne office with a whiteboard and a
> screen to flesh out some ideas for things we can do to fix some of the
> issues that came up after a recent machine failure event at FastMail.> 
> In particular, sync sheer is a very real problem.  The core issue is
> something like this, either:> 
> a)
> sync_log MAILBOX A
> sync_log MAILBOX A
> sync_log MAILBOX B
> Underlying cause - something happened on mailbox A, then mailbox A was
> renamed to B.> 
> Result - if there is a log split between those two lines, the
> sync_client first sees just MAILBOX A, and so it just processes that
> one mailbox.  It sees:> 
> remote: MAILBOX A exists
> so it issues an UNMAILBOX A, then processes the second file.
> In the second file, it gets:
> local: MAILBOX B exists
> So it creates B and copies all the messages again.  This is correct,
> but it's both inefficient and creates a gap where the replica doesn't
> have the messages at all.> 
> b) there are over 1000 mailboxes, and the log file got deduplicated
>    and then run in sets, and we had this:> 
> sync_log MAILBOX Z
> sync_log MAILBOX B
> (for a rename of Z to B)
> local: MAILBOX B exists
> We upload the entire mailbox.  Later we see both mailbox Z and mailbox
> B, and due to uniqueid duplication and the existence of mailbox B, we
> forget about mailbox Z entirely - leaving a duplicate on the server.
> Until a recent change, this led to real mess when running reconstruct
> caused mailbox Z to get a new uniqueid, just on the end where the
> reconstruct was run.  Run it on both ends later, you could wind up
> with different uniqueids, and replication bails on that because it's
> confused!> 
> The long term solution to all this is to replicate by uniqueid, and
> replicate the name history entirely for each folder such that you can
> calculate the delta and converge on the latest name for the folder in
> split brain.  But for now, maybe we can make this safer.> 
> My initial thought is something like: if the folder exists at one end,
> but not at the other (either way) do a full user sync.> 
> Also, if splitting > 1000 folders in sync_client, make sure we keep
> all the folder for a user in a single batch, so don't split batches
> inside a user.> 
> We may be able to use the tombstone records we've been storing for a
> while to decide whether the lack of a folder is "it used to exist, but
> it doesn't any more" or "it never existed here" - handy for figuring
> out split brain recovery.> 
> Added complications: what about cross-user renames?  What about
> renaming users entirely?> 
> I know some of what Ken has been working on will also possibly
> interact with this, so we're looking for some simple heuristic changes
> that can make everyday situations safer while we wait for the real
> solution.> 
> Bron.
> --
>   Bron Gondwana, CEO, FastMail Pty Ltd
>   brong at

  Bron Gondwana, CEO, FastMail Pty Ltd
  brong at

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Cyrus-devel mailing list