LARGE single-system Cyrus installs?

Mon Nov 26 20:51:14 EST 2007

>> Note that ext3 effectively does the same thing as ZFS on fsync() - because
>> the journal layer is block based and does no know which block belongs
>> to which file, the entire journal must be applied to the filesystem to
>> achieve the expected fsync() symantics (at least, with data=ordered,
>> it does).
>
>Well, "does not know which block belongs to which file" sounds weird. :)
>
>With data=ordered, the journal holds only metadata. If you fsync() a
>file, "ordered" means that ext3 syncs the data blocks first (with no
>overhead, just like any other filesystem, of course it knows what blocks
>to write), then the journal.
>
>Now, yes, the journal possibly contains metadata updates for other files
>too, and the "ordered" semantics requires the data blocks of those files
>to be synced as well, before the journal sync.
>
>I'm not sure if a fsync() flushes the whole journal or just up to the
>point it's necessary (that is, up to the last update on the file you're
>fsync()ing).

The ext3 journalling layer only knows about blocks. When using
data=ordered, only metadata *blocks* are tracked by the journalling layer.
The journalling layer does not know which data blocks correspond to
which metadata block, so everything is forced out.

Most other journalling file systems operate within the filesystem
abstraction, and journal atomic filesystem operations, which leaves them
better able to implemented sane fsync() symantics.

>data=writeback is what some (most) other journalled filesystems do.
>Metadata updates are allowed to hit the disk _before_ data updates. So,
>on fsync(), the FS writes all data blocks (still required by fsync()
>semantics), then the journal (or part of it), but if updates of other
>files metadata are included in the journal sync, there's not need to
>write the corresponding data blocks. They'll be written later, and
>they'll hit the disk _after_ the metadata changes.

This is possible because those other journals operate at the filesystem,
not block level.

>If power fails in between, you can have a file whose size/time is
>updated, but contents not. That's the problem with data=writeback, but
>it should be noted that's pretty normal for other journalled
>filesystems, too. It applies only to files that were not fsync()'ed.

And, in this case, you're no worse off than you would have been with a
traditional filesystem such as UFS.

>I think that if you're running into performance problems, and your
>system is doing a lot of fsync(), data=orderer is the worst option.

You're assuming fsync() behaviour changes with the other data=
options - have you looked into it? I'm wary because the ext3 guys have
a long history of simply not getting what fsync() is for, what it's
supposed to do, and why it's important. I recently asked Andrew Morton
whether fsync() behaviour changed with data= options, but he couldn't
remember, and I haven't had time to look into it myself.

>data=journal is fsync()-friendly in one sense, it does write
>*everything* out, but in one nice sequential (thus extremely fast) shot.
>Later, data blocks will be written again to the right places. It doubles
>the I/O bandwith requirements, but if you have a lot of bandwidth, it
>may be a win. We're talking sequential write bandwidth, which is hardly
>a problem.

This is true, right up until the point you fill the journal... 8-)

>data=writeback is fsync() friendly in the sense that it writes only the
>data blocks of the fsync()'ed file plus (all) metadata. It's the lowest
>overhead option.
>
>If you have a heavy sustained write traffic _and_ lots of fsync()'s,
>then data=writeback may be the only option.
>
>I think some people are scared by data=writeback, but they don't realize
>it's just what other journalled FS do. I'm not familiar with ReiserFS,
>it think it's metadata-only as well.

Certainly data journalling is the exception, rather than the rule. Off
the top of my head, I can't think of another mainstream filesystem that
does it (aside from the various log-structured filesystems such as Waffle
and Reiser4).

>data=ordered is good, for general purpose systems. For any application
>that uses fsync(), it's useless overhead.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/