Some thoughts on low bandwidth replication for Cyrus

Tue Jan 5 05:40:49 EST 2010

(I wrote this on the train on the way home after chatting to Rob
about it.  It suffers a little from the lack of the pictures on
the whiteboard in our office, but not too much.  Sorry for the
length, and I hope it's clear what I'm getting at...)

Thoughts on low bandwidth replication for Cyrus.

At the moment Cyrus replication isn't suitable for slow
links because any flags change or message copy triggers
a MAILBOXES event, which has a bandwidth cost proportial
to EXISTS thanks to sending a full line of data per
message in the mailbox.

This document describes a technique for avoiding the high
bandwidth cost without losing the ability to detect
unexpected corruption on the remote side.

PRECONDITIONS:

* cyrus.index and cyrus.expunge merge per my earlier emails
  to the list.  This means that "expunged" records still
  exist in the cyrus.index file, and have their modseq
  updated.  Bonus - expunged records are still replicated
  right up until cyr_expire cleans them up.
* modseq calculation always on, i.e. condstore always
  supported.

This document will also detail the operation of the checksum
patch already in operation at FastMail, and extentions
required.

CHANGES:

cyrus.index header additional fields:
=====================================

CYRUS_HEADER_CRC: uint32_t
RECORD_CRC_XOR: uint32_t
LAST_EXPIRE: time_t
HEADER_CRC: uint32_t

HEADER_CRC is already present at FastMail - it is always 
the final record of the index file, and is the CRC32 of 
the rest of the record up to that offset.

CYRUS_HEADER_CRC is the CRC32 of the entire cyrus.header
file (note: this includes user flag names)

LAST_EXPIRE is the time that cyr_expire was last run on
this mailbox, removing any expired DELETED records.

RECORD_CRC_XOR will be explained in a second :)

=====================================
cyrus.index record additional fields:
=====================================

CACHE_CRC: uint32_t
RECORD_CRC: uint32_t

Both already present at FastMail.  CACHE_CRC is the CRC32
of the entire cyrus.cache record for this message.  It can
be used to check the integrity of the cache file.

RECORD_CRC is calculated the same way HEADER_CRC is.  It's
a CRC32 of the buffer containing the rest of the index record,
and is always the final field.

RECORD_CRC_XOR is just the CRC32 of all the records in the
cyrus.index.  Its purpose is to allow an identical HEADER_CRC
on a mailbox to be a very good indication that there are no
discrepencies between records.  It's not a 100% guarantee, 
but with the other items shown below, it is sufficient.

=======================
sync protocol workings:
=======================

> MAILBOXES shared.notices user.foo.bar user.xyz user.xyz.Trash
< * shared.notices <uniqueid> <highestmodseq> <lastuid> <header_crc> <last_expire>
< * user.foo.bar <uniqueid> <highestmodseq> <lastuid> <header_crc> <last_expire>
< * user.xyz <uniqueid> <highestmodseq> <lastuid> <header_crc> <last_expire>
< * user.xyz.Trash <uniqueid> <highestmodseq> <lastuid> <header_crc> <last_expire>
< OK MAILBOXES completed

>From this we determine that user.xyz was unmodified,
user.foo.bar has the same lastuid, but a lower highestmodseq
on the server, and user.xyz.Trash has had a message appended 
(highestmodseq is lower and lastuid is lower).  Finally,
shared.notices has a different last_expire time.  

Given that highestmodseq, lastuid and header_crc are IDENTICAL,
we are safe in assuming that no changes need to be made to 
user.xyz.  Even if by some rare clash there had been conflicting
changes made at both ends and they had both caused identical 
CRCs right through the chain, they are very likely to cause 
a clash next change when the CRC calculations have to be redone,
at which point a full sync will be done on that mailbox.  Again,
only occurs after a split brain.

So - user.foo.bar: we read the client cyrus.index and determine
which records have a higher modseq than server->highestmodseq.

> RECORDS user.foo.bar <offset> <index record> 0 <offset> <index record> 0 [...]
< * user.foo.bar <uniqueid> <highestmodseq> <lastuid> <header_crc> <last_expire>
< OK RECORDS completed

At this point, highestmodseq, lastuid, header_crc and last_expire
ALL MATCH.  We have successfully updated user.foo.bar with 
bandwidth use proportional to the number of _changed_ records,
not the total number.

The '0' means "don't need to copy the message file from stage".

What's left?  user.xyz.Trash.  lastuid is lower, so there are
appends, as well as potential flag changes.

> RESERVE <guid1> <guid2> [...]
< * RESERVE <guid1>
< OK RESERVE completed

guid1 was found on the server in one of the mailboxes mentioned
during this sync run.  guid2 wasn't found, hence no RESERVE record.

> UPLOAD <guid2> {size+}
> ...
< * RESERVE <guid2>
< OK UPLOAD completed

Now the server has copies of both messages staged.

> RECORDS user.foo.bar <offset> <index record> 1 <offset> <index record> 1 [...]
< * user.foo.bar <uniqueid> <highestmodseq> <lastuid> <header_crc> <last_expire>
< OK RECORDS completed

Just like "RECORDS", but with the added proviso that the server
knows to copy the message files from the stage because we passed
a '1'.  (exact implementation details might change of course... 
the server could even determine this itself based on <offset> 
being past mailbox->num_records)

Note the response was a mailbox state statement, which again
matches.  Fantastic, we know that was everything.  Of course, we 
would have issued RECORDS lines for any non-append record with a
higher modseq as well if necessary.

Note that I missed one.  shared.notices.  The last_expire 
timestamp didn't match.  We read the client's last_expire 
and issue:

> EXPIRE shared.notices
< * shared.notices <uniqueid> <highestmodseq> <lastuid> <header_crc> <last_expire>
< OK EXPIRE completed

If they now match, good.  Otherwise, steps as above.  In a 
pathological case, the last_expire on the server is NEWER.  You 
need to run the expire on the client with that timestamp in this
case to get them in sync.  And yes, both ends need the same expire 
policy.  Easy way is to extend the EXPIRE command above to have
the master's config variables.

I think that's everything.  Oh, I didn't explain changes in
cyrus.header.  Basically, you would have to fetch the userflags
string from the server and compare to the client.  If no clashing
names, make them identical, otherwise you'd need to do a full
mailbox sync to find out what the flags are on the server, and
lock while you renumbered the server end.  A big pain, but very
rare.  It's essential that the ordering of user flags be exactly
the same at both ends for the checksumming to be efficient, otherwise
you'd have to calculate an equivalent checksum at each end for the
comparison to work.  Possible, but much more expensive just to get a
yes or no to the "do I need to sync" question.

Regards,

Bron.