Robust atomic multiple folder rename
    Bron Gondwana 
    brong at fastmail.fm
       
    Wed Jan  1 17:52:13 EST 2014
    
    
  
So I have a problem.  User renames are
a) not atomic (in fact, any rename of a folder with subfolders is not atomic)
b) bandwidth wasteful with replication if a sync_client picks up the wrong set of folder names too early (it winds up deleting the old user, then having to copy all the messages again)
(a) is a general problem.  There are failure modes which can leave manual cleanup required.  I hate that.
(b) is what's causing me issues RIGHT NOW.
I sat down with Rob Mueller last Friday to talk this through, and we've come up with what I believe is a good solution to both these problems.  It's crash-safe, auto-cleanup supporting (both immediately if a folder is requested, and next cyr_expire run otherwise) and atomic.
This change requires the flexible mboxlist format now present in the master tree.  This format is a key-value format allowing arbitrary items to be stored in the mboxlist file.
Consider the following rename:
user.foo.sub         => user.foo.new
user.foo.sub.A       => user.foo.new.A
user.foo.sub.B       => user.foo.new.B
(it will work for all other cases as well)
First we take an exclusive namelock on user.foo.sub.  This namelock will be held for the entire time.
Next we take an exclusive lock on mailboxes.db, and insert/replace the following records:
user.foo.sub         %(NAMELOCK user.foo.sub TYPE RENAMELOCKED)
user.foo.sub.A       %(NAMELOCK user.foo.sub TYPE RENAMELOCKED)
user.foo.sub.B       %(NAMELOCK user.foo.sub TYPE RENAMELOCKED)
user.foo.new         %(NAMELOCK user.foo.sub TYPE RENAMETEMP)
user.foo.new.A       %(NAMELOCK user.foo.sub TYPE RENAMETEMP)
user.foo.new.B       %(NAMELOCK user.foo.sub TYPE RENAMETEMP)
And we release the mailboxes.db lock, allowing the rest of the server to run happily.
We then create the on-disk directories for user.foo.new and friends, and we copy EVERYTHING
into the new locations, including building the cyrus.index files, linking the spool, etc.
Any other process which tries to open any of these mailboxes will see the 'TYPE RENAMELOCKED' field and block on the locking the NAMELOCK field's lock file until the rename is either finished or aborted.  Because they block, there are no spurious errors returned to clients during the rename.
A successful rename - after all the files are in place, we take an exclusive lock on the mailboxes.db again and in an atomic transaction we make the following changes:
user.foo.sub         %(NAMELOCK user.foo.sub TYPE RENAMETEMP)
user.foo.sub.A       %(NAMELOCK user.foo.sub TYPE RENAMETEMP)
user.foo.sub.B       %(NAMELOCK user.foo.sub TYPE RENAMETEMP)
user.foo.new         %()
user.foo.new.A       %()
user.foo.new.B       %()
So the new folders are now ready to use, but this process is still holding the old folders, and still holding the old namelock.
At this point, we go through all the old folder and delete the on-disk files.  Once that's done, we can do a single update in the mailboxes.db:
user.foo.sub         %(TYPE DELETED)
(we keep DELETED tombstones in mailboxes.db now to ensure that UIDVALIDITY never gets reused, but also to detect the difference between "folder created on A" and "folder deleted on B" in a multi-master replication setup)
-------------
A failed rename - if, at any point before the atomic rename updates happen, there is an error or the process doing the rename crashes, there will be potentially files on disk in the destination folders, and there will be the initial records in the mailboxes.db.
If any other process tries to open one of those folders (including cyr_expire, which visits every folder) - it will attempt to get the namelock on the source root as per the NAMELOCK field in mailboxes.db.  When it obtains that lock, it will check to see if another process finished the cleanup first.  If not, it will either:
a) for RENAMELOCKED - just update the mailboxes.db to say that the folder is nolonger renamelocked, then go about its business.
b) for RENAMETEMP - delete all the files on disk, and remove the record from the mailboxes.db (or restore the old DELETED record if there was one)
At which point, cleanup is completed.
This will work no matter what the rename (though there might need to be some extra magic added for cross-partition renames to ensure we can clean them up safely too, since they don't have a second mailboxes.db entry).
In the interests of replication speed, we MAY convert sync_client to trylock rather than locking in these cases, and if it fails it will just insert the mailbox into the next synclog and then continue with other mailboxes.  I definitely need to make it run the 'USER' sync earlier and add all the mailboxes found in that into the general pool of mailboxes so that it detects user renames better anyway.
Bron.
-- 
  Bron Gondwana
  brong at fastmail.fm
    
    
More information about the Cyrus-devel
mailing list