choosing a file system
Bron Gondwana
brong at fastmail.fm
Tue Dec 30 19:51:29 EST 2008
On Tue, Dec 30, 2008 at 02:43:14PM -0700, Shawn Nock wrote:
> Bron and the fastmail guys could tell you more about reiserfs... we've
> used RH&SuSE/reiserfs/EMC for quite a while and we are very happy.
Yeah, sure could :)
You can probably find plenty of stuff from me in the archives about our
setup - the basic things are:
* separate metadata on RAID1 10kRPM (or 15kRPM in the new boxes) drives.
* data files on RAID5 big slow drives - data IO isn't a limiting factor
* 300Gb "slots" with 15Gb associated meta drives, like this:
/dev/sdb6 14016208 8080360 5935848 58% /mnt/meta6
/dev/sdb7 14016208 8064848 5951360 58% /mnt/meta7
/dev/sdb8 14016208 8498812 5517396 61% /mnt/meta8
/dev/sdd2 292959500 248086796 44872704 85% /mnt/data6
/dev/sdd3 292959500 242722420 50237080 83% /mnt/data7
/dev/sdd4 292959500 248840432 44119068 85% /mnt/data8
as you can see, that balances out pretty nicely. We also store
per-user bayes databases on the associated meta drives.
We balance our disk usage by moving users between stores when usage
reaches 88% on any partition. We get emailed if it goes above 92%
and paged if it goes above 95%.
Replication. We have multiple "slots" on each server, and since
they are all the same size, we have replication pairs spread pretty
randomly around the hosts, so the failure of any one drive unit
(SCSI attached SATA) or imap server doesn't significantly overload
any one other machine. By using Cyrus replication rather than,
say, DRBD, a filesystem corruption should only affect a single
partition, which won't take so long to fsck.
Moving users is easy - we run a sync_server on the Cyrus master, and
just create a custom config directory with symlinks into the tree on
the real server and a rewritten piece of mailboxes.db so we can
rename them during the move if needed. It's all automatic.
We also have a "CheckReplication" perl module that can be used to
compare two ends to make sure everything is the same. It does full
per-message flags checks, random sha1 integrity checks, etc.
Does require a custom patch to expose the GUID (as DIGEST.SHA1)
via IMAP.
I lost an entire drive unit on the 26th. It stopped responding.
8 x 1TB drives in it.
I tried rebooting everything, then switched the affected stores over
to their replicas. Total downtime for those users of about 15
minutes because I tried the reboot first just in case (there's a
chance that some messages were delivered and not yet replicated,
so it's better not to bring up the replica uncleanly until you're
sure there's no other choice)
In the end I decided that it wasn't recoverable quickly enough to
be viable, so chose new replica pairs for the slots that had been
on that drive unit (we keep some empty space on our machines for
just this eventuality) and started up another handy little script
"sync_all_users" which runs sync_client -u for every user, then
starts the rolling sync_client again at the end. It took about
16 hours to bring everything back to fully replicated again.
Bron.
More information about the Info-cyrus
mailing list