Massive Problems (lmtp, db corruption) - Ahh!
Joe Finkle
scrasher21212121 at hotmail.com
Wed Oct 30 15:40:38 EST 2002
Hey all,
We're using postfix1.1.11 and cyrus 2.1.9 on a RH7.3 (ext3) box. We've been
having endless problems since we experienced some data corruption a few days
ago. After resolving some other problems, we're down to the key one: mail
take >12h to be delivered (and some never is delivered, just queued) and
after an hour or so of running, users are no longer able to login, or they
login and they have no folders and no mail.
Im grouping the two issues together because I believe the first is the cuase
of the second. Specifically the box shows ALOT of ltmp, lmtpd, and unix -t
processes (sometimes upto 2000 if the box is running for several hours w/o
restart).
Simlutaneously the error logs from postfix's lmtp show alot of these:
Oct 30 16:10:35 bicep postfix/lmtp[1886]: 0113E1641FF:
to=<user1 at ourhost.com>, relay=/export/cyrus/imap/socket/lm
tp[/export/cyrus/imap/socket/lmtp], delay=0, status=deferred (host
/export/cyrus/imap/socket/lmtp[/export/cyrus/imap/socket/lmtp] sa
id: 451 4.3.0 System I/O error)
After an hour or so of cyrus and postfix running, the second part of the
problem occurs, and users arent able to login or can login and see no mail,
no folders (except inbox). When this occurs, lots of the following are
printed to the error logs:
Oct 30 16:18:38 bicep imapd[2146]: DBERROR: error fetching user.gary:
DB_RUNRECOVERY: Fatal error, run database recovery
Oct 30 16:18:38 bicep imapd[2146]: DBERROR: error fetching user.gary:
cyrusdb error
Stopping cyrus and postfix for a minute, and then restarting them seems to
rememdy the problem for another hour (the mail delivered problem is still
there, but users are able to login and read mail).
I tried to run "./ctl_cyrusdb -r" but I get the following errors:
ctl_cyrusdb: unable to init environment
fatal error: can't read mailboxes file
Strace show the problem occurs after reading and closing the mailboxes.db
file. I dumped/reimported the mailboxes.db using ctl_mboxlist and it didnt
help me run ctl_cyrusdb. Coincedentally running ./ctl_cyrusdb -c in strace
shows it fails on deliver.db.
Im using Ext3 fs, and Ive run fsyc several times, w/o error.
It seems like some cyrus file(s) is/are corrupt which is causing lmtp mail
delivery to hang on a large fraction of messages that are attempting to be
delivered. Because the lmtp processes hang forever, thousands are active,
and the system's IO resources become exhausted.
This is my theory atleast. Does anyone have any ideas about what is going on
and how to go about fixing it?
Thanks,
Lee
P.S. Ive also run reconstruct -r on each top level mailbox. No help.
_________________________________________________________________
Get faster connections -- switch to MSN Internet Access!
http://resourcecenter.msn.com/access/plans/default.asp
More information about the Info-cyrus
mailing list