UC Davis Cyrus Incident September 2007

Tue Oct 16 22:45:33 EDT 2007

On Wed, Oct 17, 2007 at 11:11:06AM +1000, Rob Mueller wrote:
> One option you have is that rather than creating separate "Zones" in the OS, 
> you just create separate cyrus instances yourself. We do this at FastMail. 
> Basically we've partitions all our storage into 300G units, and each 
> partition has a completely separate config dir, cyrus.conf, imapd.conf, 
> mailboxes.db, partition, etc and runs a separate cyrus master process. It 
> required some automation (eg we have scripts that auto-generate the 
> appropriate cyrus.conf, imapd.conf and init scripts), and it means every 
> cyrus command run needs the -C parameter to tell which instance you're 
> dealing with, but it means that each instance of cyrus is much smaller (eg 
> much smaller mailboxes.db file, deliver.db file, etc) and there's less 
> "global" contention points.

It's amazing how much you can abstract away behind your toolkit
eventually.  It's a long time since I last had to type a -C parameter by
hand, since I can just say (in perl)

my $Slot = ME::ImapSlot->new($SlotName);
$Slot->RunCommand("reconstruct", "-r", $MailboxName);

Or from the command line:

% cyr -r store23 dbtool show user.brong.

(which will show all the folders for me on the replica for my store)

NOTE: you want really clear terminology if you go this route to keep
the complexity under control.  Here's what we use.

"slot" - a pair of physical partitions on a machine.  One meta, one
data.  So for example:

imap6$ df | grep ta8
/dev/sdb8             14016208   9897628   4118580  71% /mnt/meta8
/dev/sdd4            292959500 243479660  49479840  84% /mnt/data8

These are named following a strict pattern based on the host and
partition numbers, so this is slot608 because it's the 8th slot
on imap server number 6.

"store" - a logical email server instance.  These consist of two slots
which have the additional terminology "Preferred Master" and "Current
Master".

Store: store23
Master: slot608
Replica: slot308

Store: store24
Master: slot908
Replica: slot503

My mailbox happens to be on store23.  The layout logic is a bit random -
basically I've tried to spread the load around so if any one machine is
taken out of service the masters go to many other machines so no one
machine takes a sudden load hit.

We use the IPaddr2 tool from linux-ha.org (heartbeat-2 package on
Debian) to assign IP addresses in our startup scripts.  These are
tied to hostnames, so we have hostnames "store23m.internal" and
"store23r.internal" tied to the IP addresses, and whichever slot is
currently master will bind to the store23m.internal IP address, so
no matter what the failover status you can connect to store23m.internal
and know you're talking to the master.

We also have a status tool that gives a breakdown of a bunch of
interesting layout things, including stores running "backwards", 
i.e. PreferredMaster != CurrentMaster.

MOVED:
store13   slot405 => slot703

And the Perl APIs for dealing with these abstract stuff away very
nicely, so I can just say:

my $User = ME::User->new_cyrusname('brong');
my $Store = $User->ImapStore();
my $Slot = $Store->MasterSlot();
$Slot->RunCommand('cyr_synclog', '-u', 'brong');

and under the hood it will connect to the correct machine via ssh
and run the command to add "USER brong" to the sync log for slot608.

You really need this much infrastructure if you want to run this many
cyrus instances (over 100 at last count across all our machines - you
don't want to be doing ANYTHING by hand).

And right now, just one of those stores is running Cyrus 2.3.10
pre-release plus various funky bugfixes and our patches ported forwards.
It took about 10 lines of code to make this happen thanks to the
massively templated nature of everything - build a new cyrus package
that installs to /usr/cyrusnew/ and point that store there in all
the initscripts and config files.  Not only is it all automatic, but
there's a master config file that can be interogated to tell us about
it.  Nothing gets changed by hand, everything is version controlled.

> Actually the reason for breaking up into smaller units was for replication 
> reasons, but despite that, I think it's been a good move in many respects. 
> Of course if you have global shared folders or want a a murder environment, 
> then that probably just makes things harder and each instance needs a 
> "complete" mailboxes.db file I believe anyway, but we've decided that we 
> don't need the one main thing murder offers which is global shared folders.

The other nice thing is that we can actually restore one of these from
backups in under a day, or copy it to a new drive unit.  With 2TB
partitions (as we had before this change) it would take a week to do
anything to the partition - you are basically tied to that set of disks.
Users cope much better with a single day outage than a week outage in
the "worst case" - and besides we can usually recover them to multiple
different machines and turn it down to a couple of hours outage.

Bron ( and that's if we suffered a major corruption event that took
       out master AND replica for a store )