Message UUIDs for replication

Fri Jun 15 09:30:37 EDT 2007

On Fri, Jun 15, 2007 at 09:32:04AM +0100, David Carter wrote:
> The good people at Fastmail have a patch to use the first 11 bytes from the 
> message MD5 as the message UUID. I would like to second this proposal.
>
> The existing UUID scheme is something of a botch job after my first plan 
> (Mailbox UniqueID + Message UID) fell through. There is also some potential 
> for people to shoot themselves in the foot if they are not obsessively 
> paranoid about the /var/imap/master_uuid files.
>
> In the long term I think that it would make sense to reserve some space in 
> cyrus.index the next time that the index format is changed so that UUIDs 
> can be expanded to be (at least) 16 bytes. However I agree with Rob that 
> the chances of a birthday paradox collision with 11 bytes are tiny.

Yeah, though I'd be a lot more comfortable if we had space for a sha1 in
there (plus we use sha1s in all the rest of our infrastructure).  I'd
also like to extend IMAP to allow you to trivially fetch these things
(OK, I already did, FETCH UUID - as well as FETCH RFC822.FILESIZE and
FETCH RFC822.MD5 - both of which do stuff to the actual file on disk to
ensure it matches the index!) - but that's totally non-standard right
now.

I'm already using the property in our backup server to ensure that the
index files and data files we back up match:

die "File $CyrusName $FolderName $uid ($uuid) changed underfoot $md5" 
  unless substr($uuid, 2) eq substr($md5, 0, 22);

I really, _really_ love integrity checking like this, because it's a
big fat canary in my cron emails if anything goes wrong - and because
we index the entire per-user backup by message UUID and parse the full
index file and expunge file if either have changed since the last backup
run, any uuid mismatch will cause the underlying uid file to be
fectched, md5ed, compared.

This backup system is just about to go into production by the way (as
in it's about 60% completed backing up our entire population, running
concurrently with our old backup system onto a new Sun "Thumper" server)

It looks something like this:

backupstate.sqlite3
backupdata-$unixtime.tar.gz

The backupstate file can be entirely re-generated from the contents of
the .tar.gz and only exists for fast lookups of what's in there and
dirty-percentage calculations.  It averages between 1 and 5 percent of
the .tar.gz file size depending on average message sizes for the user.

The tar file format is:

meta/seen
meta/sub
sieve/websieve
files/$uuid
files/$uuid2
files/$uuid3
folders/$uniqueid/cyrus.index
folders/$uniqueid/cyrus.expunge
folders/$uniqueid/cyrus.header
folders/$uniqueid2/...
folders/$uniqueid3/...
imap/user.brong/ => folders/$uniqueid
imap/user.brong.Trash/ => folders/$uniqueid2

mostly the tar fields are used for exactly what you would expect, but
I am storing the master server's inode number in devminor because 
there's nowhere else to stash it that's rebuild-safe.  Yes, this does
mean I'll wind up fetching all indexes again if I fail over to the
replica, but I guess I can handle that cost :)

  (an aside: to guarantee consistent reads of meta data we do a 
   two-pass run:

   a)
     * fcntl the header file,
     * parse the unique id, 
     * stat all cyrus.* files,
     * unlock.

   b) IF (uniqueid or ANY stat has changed since last run):
     * fcntl the header file, 
     * parse the unique id, 
     * stat all cyrus.* files, 
     * copy contents of all cyrus.* files,
     * unlock.

   This guarantees that the cyrus.index and cyrus.expunge are
   consistent with each other, and are for the uniqueid in the
   header file (no move and recreate under us mister user).
  )

The database format is a bunch of tables built from parsing these names,
as well as a "indexed" and "expired" which look like this:

CREATE TABLE indexed (
  folderpath text,    -- folders/$uniqueid
  uid int,
  uuid text,
  PRIMARY KEY (folderpath, uid),
  INDEX (uuid),
);

This is handy because they update a 'refcount' variable for the uuid
files based on insert and delete triggers (saves expensive joins is my
dodgy theory!)

It means folder moves are really cheap, each message is only stored
once, etc.

Oh, and we keep an "offset" variable in the tar file for each entry in
the database, and so when an index has been fetched lots of times only
the most recent copy of the index file actually has a valid offset.

To compress the backups, we basically do:

zcat $datafile | decide oldstate.sqlite newstate.sqlite | gzip > $newdatafile

Where 'decide' is a funky piece of logic that can parse a tar stream,
use oldstate to choose files to keep and pipe a tar stream back out
while also generating newstate.sqlite.

Unsurprisingly, this is all encapsulated in TarStream.pm, which I'm
hoping to polish slightly and push out to CPAN.  I think it's valuable
enough for that.  Our older backup system unpacked entire tar files
onto the filesystem, deleted the files it didn't want, re-tarred the
filesystem and then deleted the rest.  This was painfully IO heavy
and/or memory harassing for no real benefit.  Yay TarStream.

I also have two perl modules: Cyrus::IndexFile and Cyrus::HeaderFile
which can read and write said formats into perl data structures.
The IndexFile one is a bit funky, if it's not a version 9 file it
will barf, but it's pretty extensible just by adding a format
description to a hash at the top - in particular I could easily go
back and support each previous revision (where by easily I mean with
some CVS foo and copious free care factor)

Now I just wish Cyrus stored messages and folders more like this and
either ditched the crazy folderid thing all together (would mean a
real database for SEEN though) or at least stuck it in mailboxes.db
and asserted some sort of uniqueness constraint on it.  Would make
obtaining consistent backups a lot easier.

Also, we still have the annoying "gap" between a "delete folder" and
the backup run in which new messages are lost forever.  More on that
in a second.

Bron.