ReiserFS and general cyrus filesystem usage information - was Re: best filesystem for imap server

Thu Dec 2 14:38:38 EST 2004

> I didn't know reiser 3 would fully journal data (or that it has good 
> enough
> write barriers and write optimization to make sure the filesystem never
> returns before a fsync really means everything including data is on disk).
> Is that correct?  If it is, then reiser might be a better choice than ext3
> with hashing (as long as you do use a fast-as-heck nvram drive for the
> journal, of course).

We use reiserfs for our large cyrus installation. We changed from ext3 
several years ago when we found the performance problems with ext3 on large 
directories, and also filesystem corruption with the htree directory hashing 
patches that were available at that time (it was early days for the htree 
patches, unfortunately we couldn't really wait around for them to fix the 
bugs - http://www.spinics.net/lists/ext3/msg01656.html). So we tried 
reiserfs and haven't looked back since. We do tend to be a bit on the 
leading edge patch wise, so I've been keeping track of what's been going on 
with reiserfs for around 2 years now (I'm cc'ing Chris Mason one of the 
resierfs developers so he can correct/confirm the information below)

Originally reiserfs (v3) only had meta-data journaling. Sometime around 
2.4.20 Chris Mason released a bunch of patches 
(ftp://ftp.suse.com/pub/people/mason/patches/data-logging/) that introduced 
data logging to reiserfs. I'm not sure if these ever made it into the 2.4 
mainline, but I know at least suse included these patches in their kernels 
for a quite a while.

A different set of patches was required for 2.6 series. These patches 
finally made it in in >= 2.6.8.1 (and some general allocator improvements as 
well I believe). So < 2.6.8.1 reiserfs only had meta-data journaling. In 
 >=2.6.8.1 there are now 3 journaling modes.

Meta-data = You can get data corruption (but not filesystem corruption) 
because meta-data changes can be committed to the journal (eg file size 
change) before data is written. This was the only mode available in < 
2.6.8.1
Ordered = Data is written before meta-data journal is committed. This avoids 
filesystem and data corruption. This is now the default in >= 2.6.8.1
Data = All data and meta-data is written to the journal

Reiserfs does support external journals, and we have several nvram drives in 
our systems that we've moved the journals on to. While that helped, it 
turned out that's not the major IO bottleneck. We've found that the 
mailboxes.db, .seen and quota databases generate the most IO. Putting these 
on the nvram card significantly increased our performance and reduced our IO 
wait time. Aggregating some output from iostat shows this:

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
cyrusmeta        380.03       77.92      2963.97       9352     355736
rfsjournals      196.27        0.00      1570.13          0     188448
cyrusspool       206.36     1228.06      1206.53     147392     144808

As you can see, the cyrus "metadata" (mailboxes.db, .seen dbs, quota dbs) 
consumes more write IO than the message spool directories and journals for 
those directories combined. Something definitely to consider when rolling 
out a big cyrus installation. (As a side note... I was curious why the 
reiserfs journals had no read requests on them. I'm guessing that since 
journals are very short lived, the actual data remains in main memory before 
being actually written to disk, so really the journal only needs to be read 
on a reboot after a crash, otherwise it just ends up cached in main memory 
all the time)

One other useful feature of reiserfs is the "tails" feature. This is on by 
default, and it means that multiple small files can be stored in 1 disk 
block. On a space limited nvram drive, this is very useful for the legacy 
quota system which uses 1 file small file per quota root (eg usually per 
user). Even with >100,000 files, we're only using about 20M of the nvram for 
them. We had thought about using the skiplist db for quotas, but having 
spoken to Ken, found that because the skiplist db uses global locking, it 
wouldn't be appropriate. We could have used bdb, but generally have had lots 
of problems with bdb so don't entirely trust it...

I should add potential problem as well. There appears to be an issue on 
heavily loaded linux servers with the way the the cyrus skiplist db works. 
Basically it can cause kernel deadlocks that result in unkillable processes 
stuck in D state that requires a system reboot. While we observed this 
intermittently with reiserfs (http://lkml.org/lkml/2004/7/20/127) the same 
problem existed in ext3 as well 
(http://www.ussg.iu.edu/hypermail/linux/kernel/0409.0/0966.html). It seems 
this is a very rare problem though since no-one else has reported it. There 
are patches available to fix both in case anyone else has come across it.

All up, we've been very happy with reiserfs and i'd recommend people use it, 
especially in >= 2.6.8.1 kernels where data=ordered is now the default 
option.

Rob

---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html