choosing a file system

Tue Dec 30 19:51:29 EST 2008

On Tue, Dec 30, 2008 at 02:43:14PM -0700, Shawn Nock wrote:
> Bron and the fastmail guys could tell you more about reiserfs... we've
> used RH&SuSE/reiserfs/EMC for quite a while and we are very happy.

Yeah, sure could :)

You can probably find plenty of stuff from me in the archives about our
setup - the basic things are:

* separate metadata on RAID1 10kRPM (or 15kRPM in the new boxes) drives.
* data files on RAID5 big slow drives - data IO isn't a limiting factor
* 300Gb "slots" with 15Gb associated meta drives, like this:

/dev/sdb6             14016208   8080360   5935848  58% /mnt/meta6
/dev/sdb7             14016208   8064848   5951360  58% /mnt/meta7
/dev/sdb8             14016208   8498812   5517396  61% /mnt/meta8
/dev/sdd2            292959500 248086796  44872704  85% /mnt/data6
/dev/sdd3            292959500 242722420  50237080  83% /mnt/data7
/dev/sdd4            292959500 248840432  44119068  85% /mnt/data8

as you can see, that balances out pretty nicely.  We also store
per-user bayes databases on the associated meta drives.

We balance our disk usage by moving users between stores when usage
reaches 88% on any partition.  We get emailed if it goes above 92%
and paged if it goes above 95%.

Replication.  We have multiple "slots" on each server, and since
they are all the same size, we have replication pairs spread pretty
randomly around the hosts, so the failure of any one drive unit 
(SCSI attached SATA) or imap server doesn't significantly overload
any one other machine.  By using Cyrus replication rather than,
say, DRBD, a filesystem corruption should only affect a single
partition, which won't take so long to fsck.

Moving users is easy - we run a sync_server on the Cyrus master, and
just create a custom config directory with symlinks into the tree on
the real server and a rewritten piece of mailboxes.db so we can
rename them during the move if needed.  It's all automatic.

We also have a "CheckReplication" perl module that can be used to
compare two ends to make sure everything is the same.  It does full
per-message flags checks, random sha1 integrity checks, etc.
Does require a custom patch to expose the GUID (as DIGEST.SHA1)
via IMAP.

I lost an entire drive unit on the 26th.  It stopped responding.
8 x 1TB drives in it.

I tried rebooting everything, then switched the affected stores over
to their replicas.  Total downtime for those users of about 15
minutes because I tried the reboot first just in case (there's a
chance that some messages were delivered and not yet replicated,
so it's better not to bring up the replica uncleanly until you're
sure there's no other choice)

In the end I decided that it wasn't recoverable quickly enough to
be viable, so chose new replica pairs for the slots that had been
on that drive unit (we keep some empty space on our machines for
just this eventuality) and started up another handy little script
"sync_all_users" which runs sync_client -u for every user, then
starts the rolling sync_client again at the end.  It took about
16 hours to bring everything back to fully replicated again.

Bron.