Experience with duplicate delivery database deadlocks?

Fri Aug 20 11:19:18 EDT 2004

Gentlefolk,

Does anyone have experiences they'd be willing to share with combatting 
deadlocks within a BDB 3.3 duplicate delivery database on a high-traffic Cyrus 
v2.1.16 (or earlier 2.1.x) server?  We're running a 60,000+ user/1.2 million 
message/day Cyrus postoffice on an 8-way Solaris system, and recently, we've 
started running into increasingly frequent deadlock problems with the 
duplicate suppression database.

The symptoms we're seeing are probably what you'd expect -- our cyrus.conf is 
set to allow up to 120 lmtpd children to run simulateously, and when we hit a 
deadlock condition in the duplicate suppression database, we find that all 120 
of our running lmtpds lock up waiting for write locks in the database. 
"truss" shows them all stuck in "lwp_sema_wait()" calls.  Inspection of the 
duplicate database after the fact sometimes shows corruption (usually null 
page pointers reported by db_verify), but sometimes shows nothing -- it's 
possible that we're seeing two different problems with the same end effect, 
but I suspect the database corruption is actually a side-effect of the 
deadlock problem...

We've come up with a work-around that at least allows us to correct the 
situation without performing a master restart (with 4000+ simultaneous IMAPS 
connections, a master restart isn't something we can routinely do, 
unfortunately) -- renaming the duplicate delivery database and its log and 
__db* files and then kill -15'ing all the running lmtpds seems to get us back 
to a functional state with a fresh duplicate suppression database.  We're up 
to seeing this happen a bit more than once a day now, though, and it's 
becoming seriously annoying.

We're using the db3_nosync mechanism (with BDB version 3.3.11) for our dup 
suppression database -- one option we're strongly considering is switching to 
the regular "db3" mechanism (without the nosync option) to try to avoid the 
deadlocks, but we're a bit concerned about what that may do to lmtp 
throughput.  Turning off duplicate suppression is...politically untenable...at 
this point...

We've also considered running the db3 "db_deadlock" routine to periodically 
detect and try to correct deadlock conditions in the duplicate suppression 
database, but that's also somewhat scary -- it's unclear to us exactly what 
the behavior of an lmtpd awaiting a lock in the duplicate suppression database 
would be when its waiting lock got terminated by the db_deadlock daemon...

Anyone have any experience or wisdom to share about either possible solution, 
or about other things that you've seen work in similar situations?  At this 
point, upgrading to 2.2.x is on our radar, but probably not something we can 
approach before mid-semester (2-3 months out), so suggestions for solutions 
with Cyrus v2.1.x would be most appreciated...

--Thanx much,
--Rob Carter--
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html