DBERROR: skiplist recovery errors
Lawrence Greenfield
leg+ at andrew.cmu.edu
Thu Dec 12 20:07:45 EST 2002
Date: Thu, 12 Dec 2002 20:00:52 -0500
From: "John A. Tamplin" <jtampli at sph.emory.edu>
[...]
Disks don't write one byte at a time, so a system crash during a
write can result in indeterminate state for the entire block (and
it gets worse when you go through the filesystem rather than raw
access to the disk, since data important to your file could
possibly share a physical disk block and be updated without your
knowlege or control not to mention the out-of-order writes
problem). I haven't looked into the skiplist implementation, but
fixing that problem isn't easy without a pre-image log and some
sort of timestamp/sequence number at both ends of the page. Once
you head down that road, you get very close to building a full
database system and then we are back to the SQL backend discussed
earlier.
Obviously disks do not write one byte at a time. Writes happen to
blocks of data. However, the operating system will issue the block of
data identically to the old block except for the byte (or word, or
whatever) that I've changed.
I don't think I've ever heard of a filesystem that mingles more than
one file in a single block. (If they do, it's certainly news to me,
and no reasonable model can be made that will deal with it.) The "out
of order writes problem" isn't a problem unless a disk claims to the
operating system that it has performed a physical write when it has
not; but if that is the case, obviously no durability claims can be
ensured by higher level software.
The question is whether or not a physical disk can fail to write a
block in such a way such that the old block isn't completely there and
the new block isn't completely there.
Berkeley DB's documentation has a long discussion about various models
software should draw from disks.
No logging model is possible without some sort of model of the
underlying device. The "single byte write" theory is actually not such
a bad one. It inhibits optimizations you might make from "single block
write" (but it's hard to figure out what the disk considers a "block")
or even better the "single page write" (easier to figure out what a
page is---but is it what the disk writes in?).
Larry
More information about the Info-cyrus
mailing list