annotations.db corruption

Fri Apr 7 10:35:44 EDT 2006

Bernhard Reiter wrote:
> Hello *,
> 
> we are currently debugging a problem with the Kolab Groupware server
> <http://www.kolab.org/>.  Cyrus imapd is a important core component of
> the Kolab server and Kolab makes heavy use of mailbox annotations.
> 
> We are using skiplist as backend for the annotations.db and some users
> experienced corruptions of the annotations.db.
> 
> These problems always follow the same pattern:
> 
> 1. Something goes wrong (this is the most interesting part as we
>    don't know by now _what_ exactly goes wrong.)
> 
> 2. A notice on a partial transaction is written to the log: 
>    imap[16722]: skiplist recovery /kolab/var/imapd/annotations.db: found partial txn, not replaying
> 
> 3. When trying to restart imapd it fails and writes to the log:
>    imap[16753]: DBERROR: skiplist recovery /kolab/var/imapd/annotations.db: A9FC8 should be ADD or DELETE
> 
> The actual defect in the annotations.db skiplist file always follows
> a constant pattern too:
> 
> In the "log" part of the skiplist file appears a bunch of nullbytes,
> exactly between two valid transaction, with roughly the length of an
> ADD node:
> 
>     0012e910: 7365 7276 6572 2064 6965 0074 6578 742f
>     0012e920: 706c 6169 6e00 4424 2e1e 0000 0012 e9c4
>     0012e930: 0013 47f4 0013 496c 0013 4ce8 ffff ffff
>                                             ^^^^^^^^^
>                                regular end of ADD entry
>     0012e940: 0000 00ff 0000 0004 0011 79b4 0000 00ff
>               ^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^
>               COMMIT    DELETE entry        COMMIT
>     0012e950: 0000 0000 0000 0000 0000 0000 0000 0000
>     0012e960: 0000 0000 0000 0000 0000 0000 0000 0000
>     0012e970: 0000 0000 0000 0000 0000 0000 0000 0000
>     0012e980: 0000 0000 0000 0000 0000 0000 0000 0000
>     0012e990: 0000 0000 0000 0000 0000 0000 0000 0000
>     0012e9a0: 0000 0000 0000 0000 0000 0000 0000 0000
>     0012e9b0: 0000 0000 0000 0000 0000 0000 0000 0000
>     0012e9c0: 0000 0000 0000 0002 0000 0032 706f 6c79
>                         ^^^^^^^^^
>     Regular start of ADD entry
> 
> Issue 840 in the official Kolab bug tracker contains some more
> in-depth information on the problem and our analysis so far:
> https://intevation.de/roundup/kolab/issue840
> 
> Has anyone here on the list a idea what the reason of this could
> possibly be?  Any hints on how to continue debugging and any sound
> theory on the background of this problem would be highly appreciated.

I haven't dug into the skiplist code enough to have any *good* thoughts 
on this, but my gut is that either there is a locking problem, an 
off-by-one error or a process if dying in the middle of a txn.

The Kolab stuff probably hits annotations.db a lot more than a 
standalone Cyrus server, and is probably uncovering a race condition or 
something else related to heavy use.

> One more catch: we have found no way so fare to reproduce this
> problem.  But we have customers how experience it on a more or less
> regular basis (about once a month).

This obviously makes it tough to track down.  Is there any pattern to 
when it happens?

-- 
Kenneth Murchison
Systems Programmer
Project Cyrus Developer/Maintainer
Carnegie Mellon University