DBERROR: skiplist recovery errors

Lawrence Greenfield leg+ at andrew.cmu.edu
Thu Dec 12 20:07:45 EST 2002

   Date: Thu, 12 Dec 2002 20:00:52 -0500
   From: "John A. Tamplin" <jtampli at sph.emory.edu>
   Disks don't write one byte at a time, so a system crash during a
   write can result in indeterminate state for the entire block (and
   it gets worse when you go through the filesystem rather than raw
   access to the disk, since data important to your file could
   possibly share a physical disk block and be updated without your
   knowlege or control not to mention the out-of-order writes
   problem).  I haven't looked into the skiplist implementation, but
   fixing that problem isn't easy without a pre-image log and some
   sort of timestamp/sequence number at both ends of the page.  Once
   you head down that road, you get very close to building a full
   database system and then we are back to the SQL backend discussed

Obviously disks do not write one byte at a time. Writes happen to
blocks of data. However, the operating system will issue the block of
data identically to the old block except for the byte (or word, or
whatever) that I've changed.

I don't think I've ever heard of a filesystem that mingles more than
one file in a single block. (If they do, it's certainly news to me,
and no reasonable model can be made that will deal with it.) The "out
of order writes problem" isn't a problem unless a disk claims to the
operating system that it has performed a physical write when it has
not; but if that is the case, obviously no durability claims can be
ensured by higher level software.

The question is whether or not a physical disk can fail to write a
block in such a way such that the old block isn't completely there and
the new block isn't completely there.

Berkeley DB's documentation has a long discussion about various models
software should draw from disks.

No logging model is possible without some sort of model of the
underlying device. The "single byte write" theory is actually not such
a bad one. It inhibits optimizations you might make from "single block
write" (but it's hard to figure out what the disk considers a "block")
or even better the "single page write" (easier to figure out what a
page is---but is it what the disk writes in?).


More information about the Info-cyrus mailing list