ReiserFS and general cyrus filesystem usage information - was Re: best filesystem for imap server
Rob Mueller
robm at fastmail.fm
Thu Dec 2 14:38:38 EST 2004
> I didn't know reiser 3 would fully journal data (or that it has good
> enough
> write barriers and write optimization to make sure the filesystem never
> returns before a fsync really means everything including data is on disk).
> Is that correct? If it is, then reiser might be a better choice than ext3
> with hashing (as long as you do use a fast-as-heck nvram drive for the
> journal, of course).
We use reiserfs for our large cyrus installation. We changed from ext3
several years ago when we found the performance problems with ext3 on large
directories, and also filesystem corruption with the htree directory hashing
patches that were available at that time (it was early days for the htree
patches, unfortunately we couldn't really wait around for them to fix the
bugs - http://www.spinics.net/lists/ext3/msg01656.html). So we tried
reiserfs and haven't looked back since. We do tend to be a bit on the
leading edge patch wise, so I've been keeping track of what's been going on
with reiserfs for around 2 years now (I'm cc'ing Chris Mason one of the
resierfs developers so he can correct/confirm the information below)
Originally reiserfs (v3) only had meta-data journaling. Sometime around
2.4.20 Chris Mason released a bunch of patches
(ftp://ftp.suse.com/pub/people/mason/patches/data-logging/) that introduced
data logging to reiserfs. I'm not sure if these ever made it into the 2.4
mainline, but I know at least suse included these patches in their kernels
for a quite a while.
A different set of patches was required for 2.6 series. These patches
finally made it in in >= 2.6.8.1 (and some general allocator improvements as
well I believe). So < 2.6.8.1 reiserfs only had meta-data journaling. In
>=2.6.8.1 there are now 3 journaling modes.
Meta-data = You can get data corruption (but not filesystem corruption)
because meta-data changes can be committed to the journal (eg file size
change) before data is written. This was the only mode available in <
2.6.8.1
Ordered = Data is written before meta-data journal is committed. This avoids
filesystem and data corruption. This is now the default in >= 2.6.8.1
Data = All data and meta-data is written to the journal
Reiserfs does support external journals, and we have several nvram drives in
our systems that we've moved the journals on to. While that helped, it
turned out that's not the major IO bottleneck. We've found that the
mailboxes.db, .seen and quota databases generate the most IO. Putting these
on the nvram card significantly increased our performance and reduced our IO
wait time. Aggregating some output from iostat shows this:
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
cyrusmeta 380.03 77.92 2963.97 9352 355736
rfsjournals 196.27 0.00 1570.13 0 188448
cyrusspool 206.36 1228.06 1206.53 147392 144808
As you can see, the cyrus "metadata" (mailboxes.db, .seen dbs, quota dbs)
consumes more write IO than the message spool directories and journals for
those directories combined. Something definitely to consider when rolling
out a big cyrus installation. (As a side note... I was curious why the
reiserfs journals had no read requests on them. I'm guessing that since
journals are very short lived, the actual data remains in main memory before
being actually written to disk, so really the journal only needs to be read
on a reboot after a crash, otherwise it just ends up cached in main memory
all the time)
One other useful feature of reiserfs is the "tails" feature. This is on by
default, and it means that multiple small files can be stored in 1 disk
block. On a space limited nvram drive, this is very useful for the legacy
quota system which uses 1 file small file per quota root (eg usually per
user). Even with >100,000 files, we're only using about 20M of the nvram for
them. We had thought about using the skiplist db for quotas, but having
spoken to Ken, found that because the skiplist db uses global locking, it
wouldn't be appropriate. We could have used bdb, but generally have had lots
of problems with bdb so don't entirely trust it...
I should add potential problem as well. There appears to be an issue on
heavily loaded linux servers with the way the the cyrus skiplist db works.
Basically it can cause kernel deadlocks that result in unkillable processes
stuck in D state that requires a system reboot. While we observed this
intermittently with reiserfs (http://lkml.org/lkml/2004/7/20/127) the same
problem existed in ext3 as well
(http://www.ussg.iu.edu/hypermail/linux/kernel/0409.0/0966.html). It seems
this is a very rare problem though since no-one else has reported it. There
are patches available to fix both in case anyone else has come across it.
All up, we've been very happy with reiserfs and i'd recommend people use it,
especially in >= 2.6.8.1 kernels where data=ordered is now the default
option.
Rob
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
More information about the Info-cyrus
mailing list