Cyrus message file race conditions and failure modes

Sun Jun 21 03:43:23 EDT 2015

Obviously with computers, there can be a failure at any point.  Cyrus is very good about not returning a success code (LMTP or IMAP) until the mailbox is entirely written into a consistent state... but there are potential failures along the way.

The cyrus.index file in particular has no multi-stage commit, so it's possible for it to be corrupted, but that's an incredibly small window these days, because we store up the entire transaction's worth of changes in memory before writing anything, and then sort it by UID and stream it to disk before fsyncing the file.  This is _commit_changes and _commit_one in imap/mailbox.c.

You can read a bit more about how we avoid that hole with future pointers and three-fsync transactions in twoskip at http://opera.brong.fastmail.fm.user.fm/talks/twoskip/ - the ODP file has both the slides and the speaker's notes.  Ideally any rewrite of cyrus.index would include a similar journal-based method of protecting the file from corruption.

Anyway, let's talk about message delivery.

The first part of any message being delivered to a mailbox, either via IMAP append or LMTP delivery is the "stage." directory.  This is in the data partition (not yet ever the archive partition - but that's because I haven't written it - ideally it would spool the first megabyte to $spool/stage. and then copy it to $archive/stage. when it gets too big - or even just spool to memory for the first megabyte and then down to archive... hmm), but I'm rambling.

The point is, it's always "$spool/stage./$pid-$internaldate-$messagenum" where message files are written right now.  This is created by append_newstage in imap/append.c.  You'll note when reading it that the specific spool partition is chosen based on the mailbox name by mboxlist_findstage.  The mailbox itself was given to the append object in append_setup or append_setup_mbox.

stage->parts allows the same file to be copied to multiple separate partitions inside their own stage. directory later, which is used for single-instance-store when the same message is delivered to hundreds or even thousands of users on a single server with potentially many partitions.  Each partition has only a single copy of the message which is then hardlinked.

So any crash during the upload stage will leave a file in $spool/stage.  At FastMail, we wipe that directory on server start, otherwise we don't touch it.  That's fine, because crashes are rare.

ASIDE: at the moment, replication puts messages into $spool/sync. instead, with a different naming scheme.  Eventually I want to merge the two upload methods to create a single location where everything goes, no matter which source - and have it parse the message once and then move it to a filename based on the GUID, and build the cache structure in memory just once by parsing it.  Yet another combining of codepaths eventually :)

Next is append_fromstage, which does the actual adding the message to a folder.  Both this and append_copy actually link messages directly into the mailbox directories with "$spool/u/user/$uid." filenames.  This is done with mailbox_copyfile, which makes hard links if it can, and falls back to a streaming write otherwise.

So the copies of the files are always done first, then the cache record is created second, either with mailbox_append_cache or during mailbox_append_index_record, which will write cache before writing the record to cyrus.index.

Finally, append_commit will call mailbox_commit and clean up.

append_abort is supposed to roll back, but it's not entirely clean and able to do so at the moment.

Anyway, a failure can cause $uid. files to exist that are in the future.  A reconstruct will discover these, and used to always add them to the mailbox.  I think it can delete them instead now, since chances are that the message will be delivered again in a second anway.

If you don't run reconstruct, then they will be deleted and replaced when another message gets that UID.

Next up, the cache file gets written.  If there's a failure after this, it's fine - it just means there is stale data in the cache with nothing pointing to it.  Next time the cache file gets rewritten (malibox_rewrite_index) then those bytes won't be copied, and will disappear.

So those are the possible failure modes during append.
* spool file only
* spool file and cyrus.cache entry
* spool file, cyrus.cache entry, cyrus.index record, but the header wasn't updated correctly (invalid modseqs, uidnext not updated, messy) - or indeed the other way around here, since there's no intermediate fsync - data could be written in either order.
* completely correct append, but the client was never told.

The last case, the record exists, just the client will try to upload it again - either IMAP or LMTP delivery retries.

...

So that's arrival.  There's also cleanup. unlink() is very expensive, so we do it outside the lock.  During mailbox_index_unlock, which is where most of the interesting stuff about updating statuscache, writing data, etc happens.  During a repack or other unlink event, we keep a list of the messages to unlink, and we delete them later.

As above, the file on disk exists until AFTER the reference has been safely removed and the result of the removal fsynced to disk.  A crash or even just forced shutdown during this can leave messages lying around in the spool.

Again, reconstruct should be able to clean these up.  Another option would be to have a two-phase cleanup, which is safer.  Update the record to say "safe to unlink" but don't remove the record from cyrus.index, then unlink it, then NEXT pass, if the file doesn't exist, it's safe to remove from the cyrus.index.  (NOTE: if you don't understand about FLAG_EXPUNGED and how it works as a tombstone record for QRESYNC, it might be worth grepping around for it and seeing how it works - leaving a FLAG_EXPUNGED record in cyrus.index a little longer doesn't change what's visible to IMAP, only slows things slightly)

...

So in terms of hooking a smarter object store in here, we need a way to replace mailbox_copyfile with a record-aware copyfile, that can be used both for new messages from spool, and for existing messages being linked by append_copy.  The append_copy case will be either a noop or a refcount operation in a smart system, and the mailbox_copyfile one will know which partition to store the data on.

I'm actually thinking here that the sensible approach to single-instance store, rather than one copy per partition, is to have a list of known partitions somewhere mapping name to integer ID, and store the partition ID with EVERY SINGLE cyrus.index record.  That way partitions are per-record not per-mailbox, and you can easily implement archive partition on top of it :)  OK, that's a brand new idea, so I'll let it settle for a bit.

Apologies for rambling.  Hopefully this is useful for code navigation.  Enjoy,

Bron.

-- 
  Bron Gondwana
  brong at fastmail.fm