sync_client behaviour improvement planning

Mon Feb 26 02:28:39 EST 2018

Hey - here's me posting something to the public list instead of internal
FastMail slack.  We've been really bad at making our random ruminations
public, sorry.
Tomorrow (Tues 27th) 2pm Melbourne time, I'm going to be meeting with
ellie and maybe Partha in the Melbourne office with a whiteboard and a
screen to flesh out some ideas for things we can do to fix some of the
issues that came up after a recent machine failure event at FastMail.
In particular, sync sheer is a very real problem.  The core issue is
something like this, either:
a)

sync_log MAILBOX A
sync_log MAILBOX A
sync_log MAILBOX B

Underlying cause - something happened on mailbox A, then mailbox A was
renamed to B.
Result - if there is a log split between those two lines, the
sync_client first sees just MAILBOX A, and so it just processes that one
mailbox.  It sees:
local: MAILBOX A IMAP_MAILBOX_NONEXISTENT
remote: MAILBOX A exists

so it issues an UNMAILBOX A, then processes the second file.

In the second file, it gets:

local: MAILBOX A IMAP_MAILBOX_NONEXISTENT
local: MAILBOX B exists

remote: MAILBOX A IMAP_MAILBOX_NONEXISTENT
remote: MAILBOX B IMAP_MAILBOX_NONEXISTENT

So it creates B and copies all the messages again.  This is correct, but
it's both inefficient and creates a gap where the replica doesn't have
the messages at all.

b) there are over 1000 mailboxes, and the log file got deduplicated and
   then run in sets, and we had this:
sync_log MAILBOX Z
sync_log MAILBOX B

(for a rename of Z to B)

local: MAILBOX B exists
remote: MAILBOX B IMAP_MAILBOX_NONEXISTENT

We upload the entire mailbox.  Later we see both mailbox Z and
mailbox B, and due to uniqueid duplication and the existence of
mailbox B, we forget about mailbox Z entirely - leaving a duplicate
on the server.  Until a recent change, this led to real mess when
running reconstruct caused mailbox Z to get a new uniqueid, just on
the end where the reconstruct was run.  Run it on both ends later,
you could wind up with different uniqueids, and replication bails on
that because it's confused!

The long term solution to all this is to replicate by uniqueid, and
replicate the name history entirely for each folder such that you can
calculate the delta and converge on the latest name for the folder in
split brain.  But for now, maybe we can make this safer.

My initial thought is something like: if the folder exists at one end,
but not at the other (either way) do a full user sync.
Also, if splitting > 1000 folders in sync_client, make sure we keep
all the folder for a user in a single batch, so don't split batches
inside a user.
We may be able to use the tombstone records we've been storing for a
while to decide whether the lack of a folder is "it used to exist, but
it doesn't any more" or "it never existed here" - handy for figuring out
split brain recovery.
Added complications: what about cross-user renames?  What about renaming
users entirely?
I know some of what Ken has been working on will also possibly interact
with this, so we're looking for some simple heuristic changes that can
make everyday situations safer while we wait for the real solution.
Bron.

--
  Bron Gondwana, CEO, FastMail Pty Ltd
  brong at fastmailteam.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.andrew.cmu.edu/pipermail/cyrus-devel/attachments/20180225/93cc5064/attachment.html>