What version of BDB are people using?

Robert Mueller robm at fastmail.fm
Fri Jun 9 08:37:45 EDT 2006


I'm just trying to get an informal survey of which version or Berkeley DB 
people are using successfully in large cyrus environments. We're currently 
using:

db4-4.2.52-3.1 - old redhat based machines
libdb4.2.52-18 - newer debian based machines

Both of them seem to be a bit "flakey". We only use BDB for the deliver_db 
and use:

duplicate_db: berkeley-nosync

For the others we use the recommended skiplist (mailboxes, seen) or flat 
file (sub).

Basically what we see it that every now and then something goes wrong 
somewhere inside BDB and causes lots of processes to get caught in "busy 
wait" loop. Stracing those processes, you see something like this:

select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
...

Just over and over again very quickly (since each sleep is only on the order 
of 1000th of a second). Once this starts happening, lots of processes start 
getting caught in this state very quickly and the load on the machine 
skyrockets. If you run the BDB tool "db_stat" on the environment, you'll see 
the transaction count quickly increase towards whatever is set as set_tx_max 
in DB_CONFIG. Once it hits that, BDB goes into an error state, starts 
filling the cyrus logs with errors, and you have to complete restart cyrus 
and delete the dbs. It tends to happen between twice a week and once every 2 
months per machine, very unpredicatable when it happens, and hard to 
actually work out what's causing it or what's going on.

Given the way it's calling select() over and over as a "microsleep" 
mechanism, it seems like it's waiting for some flag to be set in some shared 
memory that's never being set due to a deadlock or something, thus causing 
every other process accessing the db to busy wait deadlock as well. Of 
course, that's just a guess.

So what I'm wondering is:
1. Has anyone else seen this sort of behaviour?
2. What versions of BDB are other people using successfully?
3. What size installation are you using it on (number of mailboxes? messages 
per minute delivered?)
4. Has anyone had any success using the berkeley-hash-nosync option? I tried 
that, and it gave me errors about "invalid page 0 type" or something like 
that pretty quickly

I'm hoping we can build up some consensus of what the most stable version of 
BDB to use with cyrus is...

Thanks

Rob



More information about the Info-cyrus mailing list