BDB and errors...

Tue Mar 14 15:20:29 EST 2006

We're using cyrus 2.3 and everything works fine, except we seem to have 
intermittent problems with BDB 4.2 (specifically the RPM db4-4.2.52-3.1). We 
only use BDB for the delivery db.

In general it works fine, however if for some reason a server has crashed 
and we reboot the server, we then seem to almost always have a problem with 
the DB.

Probably best to show a sequence of events.

1. Server froze up, so force a hard reset
2. Server boots up and remounts everything fine. All partitions are reiserfs 
and mount ok with journal playback
3. We start cyrus. Since the delivery DB is temporary and non-critical, the 
start script explicitly does:

     rm -f /var/imap/db/log.*
     rm -f /var/imap/db/__db*
     rm -f /var/imap/deliver.db

To clean out all existing BDB state and information. I can confirm that the 
only files left in the /var/imap/db dir are DB_CONFIG and skipstamp. There 
appears to be no BDB environment state
4. cyrus appears to start fine, but intermittently we see errors in the log 
like:

Mar 14 13:47:25 server1 lmtp[2514]: DBERROR: mystore: error storing 
<441702FE.6070601 at googlemail.com>: DB_PAGE_NOTFOUND: Requested page not 
found

Each time an error like this occurs, it seems to leave a transaction open. 
Running:

(cd /var/imap/db; /usr/bin/db_stat -t -h .)

Normally shows "Active transactions" as 0, but after each of the above 
errors appears in the log, the count increases and never decreases. 
Eventually this causes problems because it appears that processes get stuck 
waiting for the transaction in a semi-busy loop inside BDB (continuous calls 
to select with a 1/10th of second timeout), and the checkpointing process 
can't cleanup old log files with open transactions in them. Eventually 
either the transaction count reaches the set_tx_max value, and causes BDB to 
go into error status, or the server load increases a lot due to the 
semi-busy wait loop BDB gets in.

5. Stopping cyrus, then starting it again with the exact same start script 
usually then fixes the problem

That's the bit I don't get. Why would restarting again change anything, it 
seems that we're clearing out exactly the same data in each case, but 
there's definitely some weird state getting left behind after a hard reboot 
causing the errors, but I don't know where or why.

Has anyone seen anything similar with their servers or has any idea what 
would be causing this?

Rob