What version of BDB are people using?
Robert Mueller
robm at fastmail.fm
Fri Jun 9 08:37:45 EDT 2006
I'm just trying to get an informal survey of which version or Berkeley DB
people are using successfully in large cyrus environments. We're currently
using:
db4-4.2.52-3.1 - old redhat based machines
libdb4.2.52-18 - newer debian based machines
Both of them seem to be a bit "flakey". We only use BDB for the deliver_db
and use:
duplicate_db: berkeley-nosync
For the others we use the recommended skiplist (mailboxes, seen) or flat
file (sub).
Basically what we see it that every now and then something goes wrong
somewhere inside BDB and causes lots of processes to get caught in "busy
wait" loop. Stracing those processes, you see something like this:
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
...
Just over and over again very quickly (since each sleep is only on the order
of 1000th of a second). Once this starts happening, lots of processes start
getting caught in this state very quickly and the load on the machine
skyrockets. If you run the BDB tool "db_stat" on the environment, you'll see
the transaction count quickly increase towards whatever is set as set_tx_max
in DB_CONFIG. Once it hits that, BDB goes into an error state, starts
filling the cyrus logs with errors, and you have to complete restart cyrus
and delete the dbs. It tends to happen between twice a week and once every 2
months per machine, very unpredicatable when it happens, and hard to
actually work out what's causing it or what's going on.
Given the way it's calling select() over and over as a "microsleep"
mechanism, it seems like it's waiting for some flag to be set in some shared
memory that's never being set due to a deadlock or something, thus causing
every other process accessing the db to busy wait deadlock as well. Of
course, that's just a guess.
So what I'm wondering is:
1. Has anyone else seen this sort of behaviour?
2. What versions of BDB are other people using successfully?
3. What size installation are you using it on (number of mailboxes? messages
per minute delivered?)
4. Has anyone had any success using the berkeley-hash-nosync option? I tried
that, and it gave me errors about "invalid page 0 type" or something like
that pretty quickly
I'm hoping we can build up some consensus of what the most stable version of
BDB to use with cyrus is...
Thanks
Rob
More information about the Info-cyrus
mailing list