DBERROR: skiplist recovery errors
Lawrence Greenfield
leg+ at andrew.cmu.edu
Thu Dec 12 21:38:04 EST 2002
Date: Thu, 12 Dec 2002 20:32:50 -0500
From: "John A. Tamplin" <jtampli at sph.emory.edu>
Correct, but if the power gets cut to the drive during the block
write, all bets are off for the content of that block. True, most
of the time you won't get weird failures but then most of the time
you don't need logging at all.
Actually, for some drives the claim is that, if they are correctly
functioning, then that _won't_ happen. After all, it takes an
insignificant amount of time to write a single block to a modern
drive---there is plenty of power in capacitors et al for it to
finish writing a block.
AIX JFS allows fragments to be allocated smaller than physical disk
blocks (and in any case sharing a single write to the disk from the
buffer pool), which were used for the last block or for compressed
filesystems. Unless I am mistaken, the original 4.3 FFS had
sub-block fragments as well.
The original FFS worked on devices that were assumed to have 512-byte
blocks. It allocated space in 8K chunks. The tail end of the file
could be allocated in 1K chunks, so a single 8K could be shared by up
to 8 files---but no block could have more than one file.
Of course compressed filesystems are going to have unpredictable
allocations. The last time I read about JFS, nothing I saw indicated
that in the normal mode of operation a single block was shared between
files.
If you are doing transaction logging, and you want to make sure
that before you write to a particular page the pre-image of that
page is committed to disk, then going through the filesystem is
dangerous because the OS may re-order those writes. You can get
around it by sync/etc, but at the cost of performance.
Umm, you have to "sync" whether or not you're going through the
filesystem. Of course I carefully order synchronous writes, just like
any other database that runs on filesystems. (You'll note, for
instance, that Oracle can run on filesystems---and the reason not to
isn't for correctness, it's for performance.)
True, and you have to define what sort of things you are going to
protect against and which you won't. I still think if you want
100% guarantee of no uncomitted updates visible after a crash an no
committed updates lost, you have to use full pre/post-image logging
(preferably with layered logging to cut down on the log volume) to
raw devices. That level of protection may not be appropriate in
all (or even most) cases, but if you need it then you need it.
I'm not sure what "full pre/post-image logging" or "layered logging"
is. Gray's Transaction Processing doesn't know what they are,
either. He refers to 'before image logging'; depending on your model
of the disk this is appropriate.
Obviously Cyrus is not going to get 100% guarantees of anything, nor
does it try to. We're merely trying to make sure we don't lose mail
that people didn't want lost---well, at least not losing them without
low probability events occuring. Part of not losing mail is making
sure that the index files referencing the mail don't get lost.
cyrusdb_flat relies on the underlying filesystem to ensure correctness
of metadata.
cyrusdb_skiplist relies on the underlying filesystem (when it
checkpoints) and relies on the assumption that writing blocks are
atomic.
cyrusdb_db3 has different requirements. Sleepycat assumes that
database pages are written atomically, but they allow for a write to
part of a page corrupting other data on that page in a failure. (The
page size is selectable; Cyrus uses whatever Sleepycat's default is.)
Obviously the filesystem also makes some sort of guess at what sort of
atomic write size the underlying storage system does.
Larry
More information about the Info-cyrus
mailing list