choosing a file system
brong at fastmail.fm
Thu Jan 8 19:23:23 EST 2009
On Thu, Jan 08, 2009 at 05:20:00PM +0200, Janne Peltonen wrote:
> If I'm still following after reading through all this discussion,
> everyone who is actually using ReiserFS (v3) appears to be very content
> with it, even with very large installations. Apparently the fact that
> ReiserFS uses the BKL in places doesn't hurt performance too badly, even
> with multi core systems? Another thing I don't recall being mentioned
> was fragmentation - ext3 appears to have a problem with it, in typical
> Cyrus usage, but how does ReiserFS compare to it?
Yeah, I'm surprised the BKL hasn't hurt us more. Fragmentation, yeah
it does hurt performance a bit. We run a patch which causes a skiplist
checkpoint every time it runs a "recovery", which includes every
restart. We also tune skiplists to checkpoint more frequently in
everyday use. This helps reduce meta fragmentation.
For data fragmentation - we don't care. Honestly. Data IO is so rare.
The main time it matters is if someone does a body search.
Which leaves... index files. The worst case are files that are only
ever appended to, never any records deleted. Each time you expunge
a mailbox (even with delayed expunge) it causes a complete rewrite of
the cyrus.index file.
I also wrote a filthy little script (attached) which can repack cyrus
meta directories. I'm not 100% certain that it's problem free though,
so I only run it on replicas. Besides, it's not "protected" like most
of our auto-system functions, which check the database to see if the
machine is reporting high load problems and choke themselves until the
load drops back down again.
> I'm using this happily, with 50k users, 24 distinct mailspools of 240G
> each. Full backups take quite a while to complete (~2 days), but normal
> usage is quite fast. There is the barrier problem, of course... I'm
> using noatime (implying nodiratime) and data=ordered, since
> data=writeback resulted in corrupted skiplist files on crash, while
> data=ordered mostly didn't.
Yeah, full backups. Ouch. I think the last time we had to do that it
took somewhat over a week. Mainly CPU limited on the backup server,
which is doing a LOT of gzipping!
Our incremental backups take about 4 hours. We could probably speed
this up a little more, but given that it's now down from about 12 hours
two weeks ago, I'm happy. We were actually rate limited by Perl
'unpack' and hash creation, believe it or not! I wound up rewriting
Cyrus::IndexFile to provide a raw interface, and unpacking just the
fields that I needed. I also asserted index file version == 10 in the
backup library so I can guarantee the offsets are correct.
I've described our backup system here before - it's _VERY_ custom,
based on a deep understanding of the Cyrus file structures. In this
case it's definitely worth it - it allows us to reconstruct partial
mailbox recoveries with flags intact. Unfortunately, "seen" information
is much trickier. I've been tempted for a while to patch cyrus's
seen support to store seen information for the user themselves in the
cyrus.index file, and only seen information for unowned folders in the
user.seen files. The way it works now seems optimised for the uncommon
case at the expense of the common. That always annoys me!
> Ext4 just got stable, so there is no real world Cyrus user experience on
> it. Among other things, it contains an online defragmenter. Journal
> checksumming might also help around the write barrier problem on LVM
> logical volumes, if I've understood correctly.
Yeah, it's interesting. Local fiddling suggests it's worse for my
Maildir performance than even btrfs, and btrfs feels more jerky than
reiser3, so I stick with reiser3.
> Reiser4 might have a future, at least Andrew Morton's -mm patch contains
> it and there are people developing it. But I don't know if it ever will
> be included in the "standard" kernel tree.
Yeah, the mailing list isn't massively active at the moment either... I
do keep an eye on it.
> Btrfs is in so early development that I don't know yet what to say about
> it, but the fact of ZFS's being incompatible with GPL might be mitigated
> by this.
Yeah, btrfs looks interesting. Especially with their work on improving
locking - even on my little dual processor laptop (yay core processors)
I would expect to see an improvement when they merge the new locking
> I'm going to continue using ext3 for now, and probably ext4 when it's
> available from certain commercial enterprise linux vendor (personally,
> I'd be using Debian, but the department has an official policy of using
> RH / Centos). I'm eagerly waiting for btrfs to appear... I probably /would/
> switch to ReiserFS for now, if RH cluster would support ReiserFS FS
> resources. Hmm, maybe I should just start hacking... On the other hand,
> the upgrade path from ext3 to ext4 is quite easy, and I don't know yet
> which would be better, ReiserFS or ext4.
Sounds sane. If vendor support matters, then ext4 is probably the
immediate future good choice. It's had a fair bit of work.
I'm tempted to keep an eye on tux3 too. Exciting times in the linux
filesystem world at the moment.
More information about the Info-cyrus