Backup strategy for large mailbox stores

Mon Feb 15 06:28:44 EST 2010

On Mon, Feb 15, 2010 at 10:28:13AM +0000, Gavin McCullagh wrote:
> Hi,
> 
> I'm a relative newbie with cyrus, but I'm interested in this discussion...

Hehe - you should read through the mailing list archives for how FastMail
does backups for a really complex but _FAST_ solution :)

> > Things have been working fine but off late we find that emailusage has
> > grown and so our backups take too long to complete .. we use dar to take
> > differential backups and take backups everynight. and transfer the
> > backup files to a remote server. 
> 
> Have you identified the bottleneck?  Is it disk access on the mail server
> itself, bandwidth to your remote server, something else?

I can tell you the major cost we identified - stat on every single file
in the spool.

> > If the backup is still running in the morning people notice a
> > considerable degradation of the server performance
> 
> Is this a recent linux server?  In principal, you could use ionice to class
> your dar process "idle" which should mean that users will get a better
> share of disk access.  However, that will also mean your backup takes even
> longer.  Probably not ideal.

Not really, no - and I suspect it still causes stat overload and inode
cache flushing and all that bad stuff.

> > Is there a better strategy , probably within the cyrus framework , to
> > take backups efficiently 
> 
> I've wondered about the best means of backup myself.  We've been doing
> something similar using rsync to sync the mail spools and other associated
> data to a remote server.  This works, but I'm slightly worried that we
> continue delivering mail throughout the process.  So our mail spool is
> changing as we back it up.  I've considered the possibility of stopping all
> daemons, taking an LVM snapshot, restarting and backing up the snapshot.
> That way you get a consistent spool where everything was backed up at the
> same moment.  On the other hand, it appears that you can generally
> reconstruct mailboxes, so perhaps I just don't need to worry about that.
> I'd prefer the cosy feeling of knowing the data is in a consistent state
> though.

Our backups are consistent per mailbox - not even per user - I considered
doing that but the deadlock risk is too high.

> If you simply can't run an incremental or differential backup in the
> "quiet" time, perhaps it would make more sense to do rolling replication to
> another server.  Then, your backup can stop the replication temporarily,
> backup the replica and start the replication back up -- leaving the live
> server alone.  I imagine this does add load to the main server, but
> distributes it over the whole day.
> 
> http://cyrusimap.web.cmu.edu/imapd/install-replication.html

Yes - that's certainly a solution!  I prefer to back up the master
than the replica in our particular setup because it's more likely
that files will be "hot" on the master.  Not that much more likely,
but hey.  Also, up-to-dateness if replication is running behind and
generally not needing to bookkeep about which replicas might be 
"down" for some reason.

We run daily backups during the quiet period - they complete for all
users in a little under 5 hours at the moment, with the bottleneck
actually being CPU on backup server (we gzip everything on the backup
server and it's a single CPU Sun x4500 - we could buy more CPU if it
became and issue)

REALLY BRIEF OVERVIEW: (I don't mind re-writing this because it keeps
it fresh in my mind!)

* every cyrus server runs a backupd which speaks a very simple protocol
* there are 8 "backup threads" running on the backup server.
* these is one "feeder thread" per Cyrus drive unit RAIDset, with a 
  list of all users on partitions on that set of drives, meaning we
  never hit a single set of physical drives with multiple concurrent
  backup requests.  Just to keep load reasonable.  This list gets pulled
  from the list of active users in the database and re-filled once per
  day.
* each backup thread randomly pulls a set of 50 users off a feeder thread
  and backs them up.  50 is a nice balance between thrashing around too
  much and providing easy-to-read feedback :)

FOR EACH USER:

* the backup server contacts the cyrus server's backupd, and:
  a) requests a listing of all folders for this user.
  b) SELECTs each folder - which involves statting each meta file
     (cyrus.index, cyrus.header, cyrus.expunge?)
     It also involves getting the mailbox UNIQUEID from the cyrus.header
     file.
  c) compares this stat data for that UNIQUEID.
  d) if unchanged, just updates the mailbox name -> uniqueid pointer if
     required (to handle mailbox renames efficiently)
  e) if changed, fetches the FULL CONTENTS of each file, while holding
     a fcntl lock on each file as well, so the folder is locked from
     changes by Cyrus processes.
  f) parses the cyrus.index and cyrus.expunge files, and checks by
     GUID (sha1) that we already have all the files.  If any are not
     present, it fetches and checks them individually.
  g) also fetches any sieve scripts, .seen files, .sub files, etc.

So - we get de-duplication both within and across folders (but only within
a single user, each user is its own entity!), we get cheap rename support,
we get super-efficient IO (never have to stat a message file)

Once the backup server has finished backing up a user and is satisfied that
the backup is complete, it updates a "LastBackedUp" timestamp in the database
for that user.  Every day, I get an email with a summary of number of backups
per hour age - just a SQL query like this:

Age     COUNT(*)
NULL    179
4       3638
5       25049
6       50262
7       51353
8       50075
9       1340

Obviously if there's anything over 24 then we have a problem! (Null is users
created since the last backup run who haven't had a backup yet...)

As you can see, we can do about 50,000 users per hour with this thing.  The
backup format is two files per user (plus a lock file while the backup is
running!)

The first file is a sqlite3 database file, containing indexed lookup data on
which files exist, how they're stored, etc - including offsets in a
theoretically unzipped .tar file.

The second is the .tar.gz file itself.  It turns out you can concatenate
.tar.gz files without the empty marker blocks on the end and they just work.
So - every backup run appends new records to the .tar.gz, including (by
abusing a few unused fields) all the metadata we need.  The .sqlite file can
be rebuilt from scratch by streaming the .tar file if need be.  

We also calculate what percentage of the file is "dirty" - stuff that no 
longer exists on the master and is over 2 weeks old.  When the file gets 
too dirty, we stream it through a processing function which only selects 
files that we want to keep.

The overhead of the .sqlite file is pretty low - around the 5% mark:

-rw-r--r--   1 fmuser   402618048 Feb 14 21:25 backupdata-1265583567.tar.gz
-rw-r--r--   1 fmuser   15737856 Feb 14 21:25 backupstate.sqlite3

That's my backups.  The datestamp in the backupdata file is the date when
it was created.  Every time it gets cleaned, it gets a new name.

I do intend to abstract this stuff out at some point.  It's very nice, and
super efficient (well, I could make it more efficient by moving away from
pure Perl to XS, but that would be micro-optimising a bit too much) - it
has its own implementations of a bunch of formats built in!  IndexFile,
HeaderFile and TarStream perl modules that can read and write those files :)

Bron.