Massive Problems (lmtp, db corruption) - Ahh!

Joe Finkle scrasher21212121 at hotmail.com
Wed Oct 30 15:40:38 EST 2002


Hey all,
We're using postfix1.1.11 and cyrus 2.1.9 on a RH7.3 (ext3) box. We've been 
having endless problems since we experienced some data corruption a few days 
ago. After resolving some other problems, we're down to the key one: mail 
take >12h to be delivered (and some never is delivered, just queued) and 
after an hour or so of running, users are no longer able to login, or they 
login and they have no folders and no mail.

Im grouping the two issues together because I believe the first is the cuase 
of the second. Specifically the box shows ALOT of ltmp, lmtpd, and unix -t 
processes (sometimes upto 2000 if the box is running for several hours w/o 
restart).

Simlutaneously the error logs from postfix's lmtp show alot of these:

Oct 30 16:10:35 bicep postfix/lmtp[1886]: 0113E1641FF: 
to=<user1 at ourhost.com>, relay=/export/cyrus/imap/socket/lm
tp[/export/cyrus/imap/socket/lmtp], delay=0, status=deferred (host 
/export/cyrus/imap/socket/lmtp[/export/cyrus/imap/socket/lmtp] sa
id: 451 4.3.0 System I/O error)


After an hour or so of cyrus and postfix running, the second part of the 
problem occurs, and users arent able to login or can login and see no mail, 
no folders (except inbox). When this occurs, lots of the following are 
printed to the error logs:

Oct 30 16:18:38 bicep imapd[2146]: DBERROR: error fetching user.gary: 
DB_RUNRECOVERY: Fatal error, run database recovery
Oct 30 16:18:38 bicep imapd[2146]: DBERROR: error fetching user.gary: 
cyrusdb error

Stopping cyrus and postfix for a minute, and then restarting them seems to 
rememdy the problem for another hour (the mail delivered problem is still 
there, but users are able to login and read mail).

I tried to run "./ctl_cyrusdb -r" but I get the following errors:

ctl_cyrusdb: unable to init environment
fatal error: can't read mailboxes file

Strace show the problem occurs after reading and closing the mailboxes.db 
file. I dumped/reimported the mailboxes.db using ctl_mboxlist and it didnt 
help me run ctl_cyrusdb. Coincedentally running ./ctl_cyrusdb -c in strace 
shows it fails on deliver.db.

Im using Ext3 fs, and Ive run fsyc several times, w/o error.

It seems like some cyrus file(s) is/are corrupt which is causing lmtp mail 
delivery to hang on a large fraction of messages that are attempting to be 
delivered. Because the lmtp processes hang forever, thousands are active, 
and the system's IO resources become exhausted.

This is my theory atleast. Does anyone have any ideas about what is going on 
and how to go about fixing it?

Thanks,
Lee

P.S. Ive also run reconstruct -r on each top level mailbox. No help.


_________________________________________________________________
Get faster connections -- switch to MSN Internet Access! 
http://resourcecenter.msn.com/access/plans/default.asp





More information about the Info-cyrus mailing list