Experience with duplicate delivery database deadlocks?
Rob Carter
rob at duke.edu
Fri Aug 20 11:19:18 EDT 2004
Gentlefolk,
Does anyone have experiences they'd be willing to share with combatting
deadlocks within a BDB 3.3 duplicate delivery database on a high-traffic Cyrus
v2.1.16 (or earlier 2.1.x) server? We're running a 60,000+ user/1.2 million
message/day Cyrus postoffice on an 8-way Solaris system, and recently, we've
started running into increasingly frequent deadlock problems with the
duplicate suppression database.
The symptoms we're seeing are probably what you'd expect -- our cyrus.conf is
set to allow up to 120 lmtpd children to run simulateously, and when we hit a
deadlock condition in the duplicate suppression database, we find that all 120
of our running lmtpds lock up waiting for write locks in the database.
"truss" shows them all stuck in "lwp_sema_wait()" calls. Inspection of the
duplicate database after the fact sometimes shows corruption (usually null
page pointers reported by db_verify), but sometimes shows nothing -- it's
possible that we're seeing two different problems with the same end effect,
but I suspect the database corruption is actually a side-effect of the
deadlock problem...
We've come up with a work-around that at least allows us to correct the
situation without performing a master restart (with 4000+ simultaneous IMAPS
connections, a master restart isn't something we can routinely do,
unfortunately) -- renaming the duplicate delivery database and its log and
__db* files and then kill -15'ing all the running lmtpds seems to get us back
to a functional state with a fresh duplicate suppression database. We're up
to seeing this happen a bit more than once a day now, though, and it's
becoming seriously annoying.
We're using the db3_nosync mechanism (with BDB version 3.3.11) for our dup
suppression database -- one option we're strongly considering is switching to
the regular "db3" mechanism (without the nosync option) to try to avoid the
deadlocks, but we're a bit concerned about what that may do to lmtp
throughput. Turning off duplicate suppression is...politically untenable...at
this point...
We've also considered running the db3 "db_deadlock" routine to periodically
detect and try to correct deadlock conditions in the duplicate suppression
database, but that's also somewhat scary -- it's unclear to us exactly what
the behavior of an lmtpd awaiting a lock in the duplicate suppression database
would be when its waiting lock got terminated by the db_deadlock daemon...
Anyone have any experience or wisdom to share about either possible solution,
or about other things that you've seen work in similar situations? At this
point, upgrading to 2.2.x is on our radar, but probably not something we can
approach before mid-semester (2-3 months out), so suggestions for solutions
with Cyrus v2.1.x would be most appreciated...
--Thanx much,
--Rob Carter--
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
More information about the Info-cyrus
mailing list