DBERROR: skiplist recovery errors

Thu Dec 12 21:38:04 EST 2002

   Date: Thu, 12 Dec 2002 20:32:50 -0500
   From: "John A. Tamplin" <jtampli at sph.emory.edu>

   Correct, but if the power gets cut to the drive during the block
   write, all bets are off for the content of that block.  True, most
   of the time you won't get weird failures but then most of the time
   you don't need logging at all.

Actually, for some drives the claim is that, if they are correctly
functioning, then that _won't_ happen. After all, it takes an
insignificant amount of time to write a single block to a modern
drive---there is plenty of power in capacitors et al for it to
finish writing a block.

   AIX JFS allows fragments to be allocated smaller than physical disk
   blocks (and in any case sharing a single write to the disk from the
   buffer pool), which were used for the last block or for compressed
   filesystems. Unless I am mistaken, the original 4.3 FFS had
   sub-block fragments as well.

The original FFS worked on devices that were assumed to have 512-byte
blocks. It allocated space in 8K chunks. The tail end of the file
could be allocated in 1K chunks, so a single 8K could be shared by up
to 8 files---but no block could have more than one file.

Of course compressed filesystems are going to have unpredictable
allocations. The last time I read about JFS, nothing I saw indicated
that in the normal mode of operation a single block was shared between
files.

   If you are doing transaction logging, and you want to make sure
   that before you write to a particular page the pre-image of that
   page is committed to disk, then going through the filesystem is
   dangerous because the OS may re-order those writes.  You can get
   around it by sync/etc, but at the cost of performance.

Umm, you have to "sync" whether or not you're going through the
filesystem. Of course I carefully order synchronous writes, just like
any other database that runs on filesystems. (You'll note, for
instance, that Oracle can run on filesystems---and the reason not to
isn't for correctness, it's for performance.)

   True, and you have to define what sort of things you are going to
   protect against and which you won't.  I still think if you want
   100% guarantee of no uncomitted updates visible after a crash an no
   committed updates lost, you have to use full pre/post-image logging
   (preferably with layered logging to cut down on the log volume) to
   raw devices.  That level of protection may not be appropriate in
   all (or even most) cases, but if you need it then you need it.

I'm not sure what "full pre/post-image logging" or "layered logging"
is. Gray's Transaction Processing doesn't know what they are,
either. He refers to 'before image logging'; depending on your model
of the disk this is appropriate.

Obviously Cyrus is not going to get 100% guarantees of anything, nor
does it try to. We're merely trying to make sure we don't lose mail
that people didn't want lost---well, at least not losing them without
low probability events occuring. Part of not losing mail is making
sure that the index files referencing the mail don't get lost.

cyrusdb_flat relies on the underlying filesystem to ensure correctness
of metadata.

cyrusdb_skiplist relies on the underlying filesystem (when it
  checkpoints) and relies on the assumption that writing blocks are
  atomic.

cyrusdb_db3 has different requirements. Sleepycat assumes that
  database pages are written atomically, but they allow for a write to
  part of a page corrupting other data on that page in a failure. (The
  page size is selectable; Cyrus uses whatever Sleepycat's default is.)

Obviously the filesystem also makes some sort of guess at what sort of
atomic write size the underlying storage system does.

Larry