LARGE single-system Cyrus installs?

Fri Nov 23 11:03:07 EST 2007

Andrew McNamara wrote:
> Note that ext3 effectively does the same thing as ZFS on fsync() - because
> the journal layer is block based and does no know which block belongs
> to which file, the entire journal must be applied to the filesystem to
> achieve the expected fsync() symantics (at least, with data=ordered,
> it does).

Well, "does not know which block belongs to which file" sounds weird. :)

With data=ordered, the journal holds only metadata. If you fsync() a
file, "ordered" means that ext3 syncs the data blocks first (with no
overhead, just like any other filesystem, of course it knows what blocks
to write), then the journal.

Now, yes, the journal possibly contains metadata updates for other files
too, and the "ordered" semantics requires the data blocks of those files
to be synced as well, before the journal sync.

I'm not sure if a fsync() flushes the whole journal or just up to the
point it's necessary (that is, up to the last update on the file you're
fsync()ing).

data=writeback is what some (most) other journalled filesystems do.
Metadata updates are allowed to hit the disk _before_ data updates. So,
on fsync(), the FS writes all data blocks (still required by fsync()
semantics), then the journal (or part of it), but if updates of other
files metadata are included in the journal sync, there's not need to
write the corresponding data blocks. They'll be written later, and
they'll hit the disk _after_ the metadata changes.

If power fails in between, you can have a file whose size/time is
updated, but contents not. That's the problem with data=writeback, but
it should be noted that's pretty normal for other journalled
filesystems, too. It applies only to files that were not fsync()'ed.

I think that if you're running into performance problems, and your
system is doing a lot of fsync(), data=orderer is the worst option.

data=journal is fsync()-friendly in one sense, it does write
*everything* out, but in one nice sequential (thus extremely fast) shot.
Later, data blocks will be written again to the right places. It doubles
the I/O bandwith requirements, but if you have a lot of bandwidth, it
may be a win. We're talking sequential write bandwidth, which is hardly
a problem.

data=writeback is fsync() friendly in the sense that it writes only the
data blocks of the fsync()'ed file plus (all) metadata. It's the lowest
overhead option.

If you have a heavy sustained write traffic _and_ lots of fsync()'s,
then data=writeback may be the only option.

I think some people are scared by data=writeback, but they don't realize
it's just what other journalled FS do. I'm not familiar with ReiserFS,
it think it's metadata-only as well.

data=ordered is good, for general purpose systems. For any application
that uses fsync(), it's useless overhead.

I've never hit performance problems, my numbers are 200 users with 2000
messages/day delivered to lmtp, _any_ decent PC handles that load
easily, and I've never considered turning data=ordered to data=writeback
for my filesystems. Now that I think about it, I've also forgot to set
noatime after the last HW upgrade (what a luxury!).

/me fires vi on /etc/fstab and adds 'noatime'

.TM.