Experience with duplicate delivery database deadlocks?

Paul M Fleming pfleming at siumed.edu
Fri Aug 20 13:16:40 EDT 2004


We had to limit the number of lmtp processes and let sendmail do the
queuing.. We're on smaller hardware and a lot fewer accounts and
messages/day but on a PIII w/ 1Gb of RAM we found 10-12 lmtpd was the
sweet spot to consistently prevent deadlocks.. 

Rob Carter wrote:
> 
> Gentlefolk,
> 
> Does anyone have experiences they'd be willing to share with combatting
> deadlocks within a BDB 3.3 duplicate delivery database on a high-traffic Cyrus
> v2.1.16 (or earlier 2.1.x) server?  We're running a 60,000+ user/1.2 million
> message/day Cyrus postoffice on an 8-way Solaris system, and recently, we've
> started running into increasingly frequent deadlock problems with the
> duplicate suppression database.
> 
> The symptoms we're seeing are probably what you'd expect -- our cyrus.conf is
> set to allow up to 120 lmtpd children to run simulateously, and when we hit a
> deadlock condition in the duplicate suppression database, we find that all 120
> of our running lmtpds lock up waiting for write locks in the database.
> "truss" shows them all stuck in "lwp_sema_wait()" calls.  Inspection of the
> duplicate database after the fact sometimes shows corruption (usually null
> page pointers reported by db_verify), but sometimes shows nothing -- it's
> possible that we're seeing two different problems with the same end effect,
> but I suspect the database corruption is actually a side-effect of the
> deadlock problem...
> 
> We've come up with a work-around that at least allows us to correct the
> situation without performing a master restart (with 4000+ simultaneous IMAPS
> connections, a master restart isn't something we can routinely do,
> unfortunately) -- renaming the duplicate delivery database and its log and
> __db* files and then kill -15'ing all the running lmtpds seems to get us back
> to a functional state with a fresh duplicate suppression database.  We're up
> to seeing this happen a bit more than once a day now, though, and it's
> becoming seriously annoying.
> 
> We're using the db3_nosync mechanism (with BDB version 3.3.11) for our dup
> suppression database -- one option we're strongly considering is switching to
> the regular "db3" mechanism (without the nosync option) to try to avoid the
> deadlocks, but we're a bit concerned about what that may do to lmtp
> throughput.  Turning off duplicate suppression is...politically untenable...at
> this point...
> 
> We've also considered running the db3 "db_deadlock" routine to periodically
> detect and try to correct deadlock conditions in the duplicate suppression
> database, but that's also somewhat scary -- it's unclear to us exactly what
> the behavior of an lmtpd awaiting a lock in the duplicate suppression database
> would be when its waiting lock got terminated by the db_deadlock daemon...
> 
> Anyone have any experience or wisdom to share about either possible solution,
> or about other things that you've seen work in similar situations?  At this
> point, upgrading to 2.2.x is on our radar, but probably not something we can
> approach before mid-semester (2-3 months out), so suggestions for solutions
> with Cyrus v2.1.x would be most appreciated...
> 
> --Thanx much,
> --Rob Carter--
> ---
> Cyrus Home Page: http://asg.web.cmu.edu/cyrus
> Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
> List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html




More information about the Info-cyrus mailing list