DBERROR: skiplist recovery errors

Thu Dec 12 22:14:00 EST 2002

Quoting Lawrence Greenfield <leg+ at andrew.cmu.edu>:

> Actually, for some drives the claim is that, if they are correctly
> functioning, then that _won't_ happen. After all, it takes an
> insignificant amount of time to write a single block to a modern
> drive---there is plenty of power in capacitors et al for it to
> finish writing a block.

Perhaps the current drives detect a power failure and refuse to start a write
when they are running on capacitor power (otherwise the same problem exists).
When I tested this quite a few years ago by cutting power to a drive (Seagate
Barracudas if I remember correctly) while it was writing - the majority of cases
were as you suggest (either the pre block or the post block entirely), a few
cases had a mix of old and new data, and sporadically I got blocks with
apparently random content.  Maybe this is no longer an issue, but it was at that
time.

>    If you are doing transaction logging, and you want to make sure
>    that before you write to a particular page the pre-image of that
>    page is committed to disk, then going through the filesystem is
>    dangerous because the OS may re-order those writes.  You can get
>    around it by sync/etc, but at the cost of performance.
> 
> Umm, you have to "sync" whether or not you're going through the
> filesystem. Of course I carefully order synchronous writes, just like
> any other database that runs on filesystems. (You'll note, for
> instance, that Oracle can run on filesystems---and the reason not to
> isn't for correctness, it's for performance.)

If you are writing to raw devices, you don't have to do a sync because your
writes don't ever go to the buffer pool so there is no need to flush it.

I have experienced data corruption with running databases in the filesystem and
not with going to raw disks.  I assume the reason is the database does not sync
after every write to ensure the ordering but does it periodically, which still
leaves them vulnerable to some out-of-order write problems during recovery.

>    True, and you have to define what sort of things you are going to
>    protect against and which you won't.  I still think if you want
>    100% guarantee of no uncomitted updates visible after a crash an no
>    committed updates lost, you have to use full pre/post-image logging
>    (preferably with layered logging to cut down on the log volume) to
>    raw devices.  That level of protection may not be appropriate in
>    all (or even most) cases, but if you need it then you need it.
> 
> I'm not sure what "full pre/post-image logging" or "layered logging"
> is. Gray's Transaction Processing doesn't know what they are,
> either. He refers to 'before image logging'; depending on your model
> of the disk this is appropriate.

The traditional way of getting atomic transactions was to have a log of physical
page images.  Before you touch a page, you had to put the pre-image of that page
in the log with the associated transaction id.  When the transaction is
committed, before you can return success you have to make sure all the affected
pages are on disk.  This can be done either as a post-image page in the log or
by making sure the real page is updated.  Sequential writes to the log are
faster than random writes to the database, so you might choose to log post-image
pages instead and then coallesce multiple writes to the database.  With a large
buffer cache, you can turn those random writes into larger sequential blocks
which makes up for the extra write, yet if the system crashes you can still
recover by rolling forward the post-image pages of all committed transactions.

Layered logging has a physical page log and a logical log.  The physical page
log works the same way, but the logical log now keeps track of higher level
operations such as "insert xxx", "btree split", "delete", etc.  As soon as a
logical operation is complete and its log record is on disk, the associated
physical log pages are no longer needed since they can be reproduced.  Informix
implements this type of logging, as do experimental databases that support
nested transactions.  With this approach, the buffer pool manager has to have
greater control over the ordering of writes as there are more dependencies. If
the only control it has is "flush everything in memory for this filehandle",
then it has to flush more often resulting in more disk traffic, more random
writes, and lower hit rates.  If the buffer manager has complete control over
write order, it can also do interesting things like queue a pre-image but cancel
it if the transaction is committed before the page gets written to disk.

> Obviously the filesystem also makes some sort of guess at what sort of
> atomic write size the underlying storage system does.

Even most journaled filesystems make no attempt to ensure that user data is
consistent in the event of a system crash, but instead just protect the
filesystem metadata.  There are certainly many levels of protection and they are
appropriate for different applications.  I suspect there are some large
installations that would prefer the additional protection of a real database,
which gets back to the other poster's request for an SQL backend.

-- 
John A. Tamplin
Unix System Administrator