The great TODO

Wed Mar 11 03:25:56 EDT 2015

First of all, before getting into the "what we need to do for 3.0" I want to wax philosophical for a moment...

Shit goes wrong.  All sorts of amazing things.

* The computer can crash at literally any moment during any action and any codepath
* The OS can re-order writes to disk in just about any way
* fsync can lie (* we can't do anything about this one)
* disks can fill up
* a partition can have wrong permissions on it, both at startup and randomly while things are running
* a partition can go missing / randomly be unmounted.
* the OS can randomly return a few bytes of zeros in the middle of your mmaped file:
  https://lkml.org/lkml/2008/6/17/9
* a multi-disk corruption can cause a random block of rubbish to appear within a file

Run a big enough set of servers for long enough, and you'll see all these things, whether due to admin error, or hardware failure...

Our job as developers of Cyrus IMAPd is to make sure that we cope with what we can, don't fail catastrophically, and make recovery as good as possible.

On the flip side, we don't want the admin to have to micro-manage everything.  As much as possible, we don't want the abstraction of a reliable mail store to leak:

http://www.joelonsoftware.com/articles/LeakyAbstractions.html

So what we want to do for Cyrus 3.0 falls into three main buckets:

1) make things more robust/scalable.  That's all these things above, handle them cleanly or provide the best possible recovery path.
2) make Cyrus easier to run/administrate.  Things in this bucket include the authentication system, backups, moving users between servers, replication, etc
3) new features and standards support.  Things like object storage, external search engines, JMAP, sieve variables/date/etc.

So if we are proposing something which takes away an existing repair mechanism - for example you can rebuild mailboxes.db by walking the tree of directories right now, we'd better be proposing something just as recoverable, but better in some way as well - like adding the mailbox name (and past mailbox names...) to cyrus.header and then storing all the files with paths based on the UNIQUEID, which is a UUID, and doesn't contain weird characters, and has a fixed length.  So you don't have stupid things like mailbox names being constrained by the characters supported by your filesystem, and case significance, and you get fast renames... but you don't lose the ability to recover.

Checksums.  We sanity check almost everywhere, because you can't do a full system scan at startup, checking the sha1 of every single file, to make sure there has been no corruption.

We scan files at backup time.  We scan them during replication.  We need a tool which scans them from a cron job for people who want to check that... maybe reconstruct needs flags to say "check but don't change things", so you can run it from cron but not be afraid that it will run when your data drive has unmounted by accident and wipe out your entire cyrus.index because it can't find the spool files.

At FastMail we have a tool that can fetch a damaged file from its replica.  We need that in Cyrus - either the magic perl script, or better - something built in to a tool in C.  Ditto for many other FastMail specific external Perl utilities.

-----

So now we know what and why we're doing... here's my rough things that need doing:

* Mailbox transactions: avoid failures leaving mailboxes in corrupt state (might require 3-fsync commit, so we at least know if it's unfinished)
* UniqueId paths (described above)
* robust backup and restore tooling
* Replication based repair:
  a) replication and existing replica awareness in code
  b) replication based XFER (falls in with this)
  c) reconstruct support for checking replicas for files
  d) reconstruct sanity checking - are the spools broken, don't keep working
* files by sha1 rather than UID in mailboxes?  Means you can't rebuild in exactly the same order without cyrus.index, but if you've lost cyrus.index you may as well just sort them by date and then give the mailbox a new UIDVALIDITY anyway.
* mailboxes.db new key format - better sorting
* For performance at scale: reverse ACL map.
* For real reliability - synchronous replicas (falls out of awareness above)

* For general speed and also safety - central cleanup daemon: use the same logic we use for sync_client and (at FastMail) squatter indexing.  Changes to mailbox cause a log entry.  A daemon processes those logs, does cleanup tasks in the background.  During startup this file can be resolved - so half-finished renames can be found and finished or reverted - so long as we log intent before making changes.. actually, I really like this:

lock(mailbox);
sync_log(mailbox->name);
/* do stuff */
unlock(mailbox);

rather than the current:

lock(mailbox);
/* do stuff */
sync_log(mailbox->name);
unlock(mailbox);

And then all the task things do a trylock, and if it fails, they just insert the record into their source log file again.  That way, they retry them again in a moment (to avoid busywait, add a pause if you didn't process ANY changes this time around).  This makes sync not wait on tasks, yet intent get logged early, before changes are made, so we can never miss something because there was a crash before the commit finished and the event was logged.

* External system integration points
* OS packages
* Docker images / VMs (for production use)

I'll try to get this into Phab tickets tonight - just about to leave work now.

Bron.

-- 
  Bron Gondwana
  brong at fastmail.fm