Does anyone else see skiplist recovery errors?

Thu Jun 15 02:26:15 EDT 2006

> I'm trying to find out if anyone else sees intermittent skiplist recovery
> problems as sometimes we do, usually after a cyrus restart.
>
> Our particular setup is cyrus 2.3.6, but we've seen this problem with
> everything from 2.1 onwards. This is on on x86 with linux 2.6.x, various
> versions. We use skiplists for the seen state db, and that's where the
> problem always occurs.
>
> Basically it seems that after restarting cyrus, sometimes none, sometimes
> a
> couple, no more than 2 or 3 of the user seen state databases will be
> corrupted. Cyrus will report something along the lines of the following in
> the log:
>
> Jun 13 18:31:00 imapx imap[21178]: DBERROR: skiplist recovery: 01F8 should
> be INORDER
> Jun 13 18:31:00 imapx imap[21178]: DBERROR: opening
> /var/cyrus/imapx/user/x/xxx.seen: cyrusdb error
> Jun 13 18:31:00 imapx imap[21178]: DBERROR: skiplist recovery: 01F8 should
> be INORDER
> Jun 13 18:31:00 imapx imap[21178]: DBERROR: opening
> /var/cyrus/imapx/user/x/xxx.seen: cyrusdb error
>
> Now checking the mailing list, it appears we're not the only ones who have
> seen this:
>
> http://www.irbs.net/internet/info-cyrus/0507/0075.html
>
> Doing as suggested there (truncating up to the problem point) does seem to
> make the error go away, and at least the skiplist can be recovered and
> written to again. Without that, the skiplist is effectively "dead", and no
> new data can be written to it, annoying in itself. Once the error is
> fixed,
> and if cyrus runs normally, then we don't see new corrupted seen state
> databases forming. It seems to ONLY happen after a cyrus restart, so it
> sounds like when an imapd shuts down due to a kill signal, it doesn't
> always
> cleanup the skiplist properly or something like that.
>
> What I wanted to do is get an idea of how common this is with people,
> since
> speaking with Ken the other day he says that they've never seen this
> problem
> in production on their systems. So I was hoping people who also see this
> could report their cyrus version and OS + hardware config. Hopefully we
> can
> hope to narrow this down a bit.

I have had problems with skiplist dbs on RedHat distributions from RedHat
7.2 to RHEL4. But IIRC it has _only_ happened in case of a unnormal system
condition, which has been a full filesystem or a hard system crash. I have
never seen any problem with normal restart of cyrus-imapd.
However, it may also depend on the way cyrus-imapd is stopped by the
system. At least on RedHat/Fedora, the function used by the init scripts
send a TERM to the master, and if it doesn't die for some time, it sends
KILL which _could_ result in corrupt ondisk data if I understand it
correctly. Maybe on very large and busy servers, the method used by
RedHat/Fedora is not so good. Maybe the stop function is really important
and should be optimized like those usually used with other slow stopping
daemons like squid.
How exactly do you stop cyrus?
Anyway, I'm not happy with how we can handle skiplist dbs. There are no
easy recovery tools which can be used to fix things other by doing by
hand. I mean, something which can be automated easily.

Simon