DBERROR: skiplist recovery errors

John A. Tamplin jtampli at sph.emory.edu
Thu Dec 12 20:32:50 EST 2002

Quoting Lawrence Greenfield <leg+ at andrew.cmu.edu>:

> Obviously disks do not write one byte at a time. Writes happen to
> blocks of data. However, the operating system will issue the block of
> data identically to the old block except for the byte (or word, or
> whatever) that I've changed.

Correct, but if the power gets cut to the drive during the block write, all bets
are off for the content of that block.  True, most of the time you won't get
weird failures but then most of the time you don't need logging at all. 

> I don't think I've ever heard of a filesystem that mingles more than
> one file in a single block. (If they do, it's certainly news to me,
> and no reasonable model can be made that will deal with it.) The "out
> of order writes problem" isn't a problem unless a disk claims to the
> operating system that it has performed a physical write when it has
> not; but if that is the case, obviously no durability claims can be
> ensured by higher level software.

AIX JFS allows fragments to be allocated smaller than physical disk blocks (and
in any case sharing a single write to the disk from the buffer pool), which were
used for the last block or for compressed filesystems. Unless I am mistaken, the
original 4.3 FFS had sub-block fragments as well.

If you are doing transaction logging, and you want to make sure that before you
write to a particular page the pre-image of that page is committed to disk, then
going through the filesystem is dangerous because the OS may re-order those
writes.  You can get around it by sync/etc, but at the cost of performance.

> No logging model is possible without some sort of model of the
> underlying device. The "single byte write" theory is actually not such
> a bad one. It inhibits optimizations you might make from "single block
> write" (but it's hard to figure out what the disk considers a "block")
> or even better the "single page write" (easier to figure out what a
> page is---but is it what the disk writes in?).

True, and you have to define what sort of things you are going to protect
against and which you won't.  I still think if you want 100% guarantee of no
uncomitted updates visible after a crash an no committed updates lost, you have
to use full pre/post-image logging (preferably with layered logging to cut down
on the log volume) to raw devices.  That level of protection may not be
appropriate in all (or even most) cases, but if you need it then you need it.

John A. Tamplin
Unix System Administrator

More information about the Info-cyrus mailing list