IMAP/LMTP/Quota locking problem
Igor Brezac
igor at ipass.net
Mon May 17 17:32:14 EDT 2004
On Mon, 17 May 2004, Scott Adkins wrote:
> I do know that as of 2.2.1, the problem still exists. I find it imposs=
ible
> to trigger the bug on my own in our test environment, and we won't be d=
oing
> an upgrade to 2.2.3 (or whatever happens to be out) until the summer ti=
me
> frame.
>
> Currently, we manage the problem by watching the sendmail syslog file f=
or
> System I/O errors on the Cyrus mailer (grep for 'mailer=3Dcyrus.*System=
.I.O')
> and when we see some, we run another script that tells us which cyrus p=
roc
> has the write lock on their quota file and we kill that process.
>
> The problem occurs regularly, with some users more reliably triggering =
the
> problem than others.
>
> For those who are curious, our Cyrus server runs on a cluster of 3 mach=
ines
> running with Alpha Tru64 5.2.
I have not seen this happen with lmtpd, but I do get an occasional such
error when a mailer (pine and OE) saves a message to the sent
folder via imap protocol:
May 17 10:35:56 imaps[9711]: [ID 136705 local6.error] IOERROR: opening /e=
xport/cyrus/mail/M/user/igor/sent-mail/cyrus.header: No such file or dire=
ctory
Unfortunately, I cannot reproduce this problem because it occurs very
infrequently. This particular server runs cyrus v2.2.2. I will try cvs
head.
-Igor
> Scott
>
> --On Monday, May 17, 2004 2:47 PM +0200 Sebastian Hagedorn
> <Hagedorn at uni-koeln.de> wrote:
>
> > Hi,
> >
> > sorry, this is quite long ... I'm looking for the updated status of t=
he
> > problem discussed in the threads "stuck lmtpd processes":
> >
> > <http://asg.web.cmu.edu/archive/message.php?mailbox=3Darchive.info-cy=
rus&ms
> > g=3D24927>
> >
> > and "followup: stuck lmtpd processes":
> >
> > <http://asg.web.cmu.edu/archive/message.php?mailbox=3Darchive.info-cy=
rus&se
> > archterm=3Dstuck%20lmtpd&msg=3D24934>
> >
> > AFAICT Cyrus IMAP 2.1.x hasn't been updated since. Could anyone comme=
nt
> > on whether this is fixed in 2.2.3 and if we *have* to update in order=
to
> > get that fix? In the last two weeks we've had this (or a very similar=
)
> > problem twice with 2.1.16 und Red Hat AS 2.1.
> >
> > For some reason in both cases tons of sendmail processes were buildin=
g
> > up. We have configured sendmail to deliver in "interactive" mode, so =
each
> > instance tries to connect to lmtpd. I would've expected those attempt=
s to
> > time out relatively quickly. I'm not sure why they don't. Anyway, tha=
t's
> > somewhat off-topic ...
> >
> > The first time it happened I was around and collected some info while=
it
> > was happening. I'd noticed that mail was no longer being delivered. W=
e
> > have maxprocs set to 30 for lmtpd:
> >
> > [root at lvr1 root]# ps -aef|grep lmtp
> > cyrus 1383 2095 0 10:10 ? 00:00:18 lmtpd
> > cyrus 1974 2095 0 10:11 ? 00:00:00 lmtpd
> > cyrus 10631 2095 0 10:36 ? 00:00:00 lmtpd
> > cyrus 17162 2095 0 10:52 ? 00:00:00 lmtpd
> > cyrus 17262 2095 0 10:52 ? 00:00:06 lmtpd
> > cyrus 25687 2095 0 11:17 ? 00:00:00 lmtpd
> > cyrus 28377 2095 0 11:23 ? 00:00:09 lmtpd
> > cyrus 31251 2095 0 11:32 ? 00:00:00 lmtpd
> > cyrus 3824 2095 0 11:46 ? 00:00:07 lmtpd
> > cyrus 7534 2095 0 11:56 ? 00:00:00 lmtpd
> > cyrus 8248 2095 0 11:58 ? 00:00:08 lmtpd
> > cyrus 22103 2095 0 12:35 ? 00:00:00 lmtpd
> > cyrus 23725 2095 0 12:39 ? 00:00:01 lmtpd
> > cyrus 25366 2095 0 12:45 ? 00:00:00 lmtpd
> > cyrus 6108 2095 0 13:23 ? 00:00:00 lmtpd
> > cyrus 10832 2095 0 13:37 ? 00:00:00 lmtpd
> > cyrus 12972 2095 0 13:42 ? 00:00:00 lmtpd
> > cyrus 16114 2095 0 13:53 ? 00:00:00 lmtpd
> > cyrus 21377 2095 0 14:07 ? 00:00:09 lmtpd
> > cyrus 24581 2095 0 14:16 ? 00:00:00 lmtpd
> > cyrus 26200 2095 0 14:21 ? 00:00:14 lmtpd
> > cyrus 26577 2095 0 14:22 ? 00:00:00 lmtpd
> > cyrus 31893 2095 0 14:35 ? 00:00:03 lmtpd
> > cyrus 32575 2095 0 14:36 ? 00:00:07 lmtpd
> > cyrus 32659 2095 0 14:37 ? 00:00:01 lmtpd
> > cyrus 14146 2095 0 15:16 ? 00:00:00 lmtpd
> > cyrus 21049 2095 0 15:36 ? 00:00:00 lmtpd
> > cyrus 26969 2095 0 15:56 ? 00:00:11 lmtpd
> > cyrus 13275 2095 0 15:15 ? 00:00:04 lmtpd
> > cyrus 17679 2095 0 13:57 ? 00:00:00 lmtpd
> >
> > Obviously something was wrong.
> >
> > [root at lvr1 root]# lsof -p 26969
> > COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
> > lmtpd 26969 cyrus cwd DIR 9,0 4096 393296 /root
> > <snip>
> > lmtpd 26969 cyrus mem REG 8,73 3841576 30
> > /var/lib/imap/mailboxes.db
> > lmtpd 26969 cyrus mem REG 8,49 194 145326099
> > /var/spool/imap/M/user/therbst/cyrus.header
> > <snip>
> > lmtpd 26969 cyrus 9u REG 8,49 3130 56013510
> > /var/spool/imap/stage./26969-1083938870
> > lmtpd 26969 cyrus 10u unix 0xef1ce040 614472274 socket
> > lmtpd 26969 cyrus 11u REG 8,49 194 145326099
> > /var/spool/imap/M/user/therbst/cyrus.header
> >
> > I gathered that /var/spool/imap/M/user/therbst/cyrus.header was causi=
ng
> > the problems.Its node number (is that the inode??) is 145326099:
> >
> > [root at lvr1 root]# grep 145326099 /proc/locks
> > 1162: POSIX ADVISORY WRITE 1974 08:31:145326099 0 EOF e9b58d54 e9b5=
87f4
> > cd5a7964 00000000 d1c324c0
> > 1162: -> POSIX ADVISORY WRITE 1383 08:31:145326099 0 EOF d1c324b4
> > c7fd745c f3bd0964 e9b58d54 c7fd7464
> > 1162: -> POSIX ADVISORY WRITE 10631 08:31:145326099 0 EOF c7fd7458
> > d1c3245c d1c324b8 e9b58d54 d1c32464
> > 1162: -> POSIX ADVISORY WRITE 17162 08:31:145326099 0 EOF d1c32458
> > d6989400 c7fd745c e9b58d54 d6989408
> > 1162: -> POSIX ADVISORY WRITE 17262 08:31:145326099 0 EOF d69893fc
> > dfb2eca0 d1c3245c e9b58d54 dfb2eca8
> > 1162: -> POSIX ADVISORY WRITE 25687 08:31:145326099 0 EOF dfb2ec9c
> > f3972be8 d6989400 e9b58d54 f3972bf0
> > 1162: -> POSIX ADVISORY WRITE 28377 08:31:145326099 0 EOF f3972be4
> > c47d96e0 dfb2eca0 e9b58d54 c47d96e8
> > 1162: -> POSIX ADVISORY WRITE 31251 08:31:145326099 0 EOF c47d96dc
> > cd5a7d58 f3972be8 e9b58d54 cd5a7d60
> > 1162: -> POSIX ADVISORY WRITE 7534 08:31:145326099 0 EOF cd5a7d54
> > c47d98ac c47d96e0 e9b58d54 c47d98b4
> > 1162: -> POSIX ADVISORY WRITE 3824 08:31:145326099 0 EOF c47d98a8
> > c7fd7400 cd5a7d58 e9b58d54 c7fd7408
> > 1162: -> POSIX ADVISORY WRITE 8248 08:31:145326099 0 EOF c7fd73fc
> > c8733be8 c47d98ac e9b58d54 c8733bf0
> > 1162: -> POSIX ADVISORY WRITE 22103 08:31:145326099 0 EOF c8733be4
> > cd197f24 c7fd7400 e9b58d54 cd197f2c
> > 1162: -> POSIX ADVISORY WRITE 25366 08:31:145326099 0 EOF cd197f20
> > d1c32290 c8733be8 e9b58d54 d1c32298
> > 1162: -> POSIX ADVISORY WRITE 23725 08:31:145326099 0 EOF d1c3228c
> > c47d9964 cd197f24 e9b58d54 c47d996c
> > 1162: -> POSIX ADVISORY WRITE 6108 08:31:145326099 0 EOF c47d9960
> > f3bd0234 d1c32290 e9b58d54 f3bd023c
> > 1162: -> POSIX ADVISORY WRITE 10832 08:31:145326099 0 EOF f3bd0230
> > c47d9290 c47d9964 e9b58d54 c47d9298
> > 1162: -> POSIX ADVISORY WRITE 12972 08:31:145326099 0 EOF c47d928c
> > d1b450c4 f3bd0234 e9b58d54 d1b450cc
> > 1162: -> POSIX ADVISORY WRITE 16114 08:31:145326099 0 EOF d1b450c0
> > dfb2e120 c47d9290 e9b58d54 dfb2e128
> > 1162: -> POSIX ADVISORY WRITE 26577 08:31:145326099 0 EOF dfb2e11c
> > f6bfb7f4 d1b450c4 e9b58d54 f6bfb7fc
> > 1162: -> POSIX ADVISORY WRITE 24581 08:31:145326099 0 EOF f6bfb7f0
> > f6bfb3a4 dfb2e120 e9b58d54 f6bfb3ac
> > 1162: -> POSIX ADVISORY WRITE 21377 08:31:145326099 0 EOF f6bfb3a0
> > c87333a4 f6bfb7f4 e9b58d54 c87333ac
> > 1162: -> POSIX ADVISORY WRITE 32575 08:31:145326099 0 EOF c87333a0
> > d1c32628 f6bfb3a4 e9b58d54 d1c32630
> > 1162: -> POSIX ADVISORY WRITE 32659 08:31:145326099 0 EOF d1c32624
> > edb35e10 c87333a4 e9b58d54 edb35e18
> > 1162: -> POSIX ADVISORY WRITE 31893 08:31:145326099 0 EOF edb35e0c
> > edb35798 d1c32628 e9b58d54 edb357a0
> > 1162: -> POSIX ADVISORY WRITE 14146 08:31:145326099 0 EOF edb35794
> > cdf5ff24 edb35e10 e9b58d54 cdf5ff2c
> > 1162: -> POSIX ADVISORY WRITE 21049 08:31:145326099 0 EOF cdf5ff20
> > e8adbca0 edb35798 e9b58d54 e8adbca8
> > 1162: -> POSIX ADVISORY WRITE 26200 08:31:145326099 0 EOF e8adbc9c
> > cd5a7798 cdf5ff24 e9b58d54 cd5a77a0
> > 1162: -> POSIX ADVISORY WRITE 26969 08:31:145326099 0 EOF cd5a7794
> > d1695628 e8adbca0 e9b58d54 d1695630
> > 1162: -> POSIX ADVISORY WRITE 13275 08:31:145326099 0 EOF d1695624
> > d1b45e6c cd5a7798 e9b58d54 d1b45e74
> > 1162: -> POSIX ADVISORY WRITE 17679 08:31:145326099 0 EOF d1b45e68
> > cdf5f5cc d1695628 e9b58d54 e9b58d60
> >
> > The PIDs waiting for locks correspond to the PIDs of the stuck lmtpd
> > processes. At that point I restarted Cyrus, because I wanted to get t=
he
> > system going again. Only later did I read that the actual problem is =
more
> > likely caused by a locked quota file. Unfortunately I did not further
> > examine PID 1974, which held the lock.
> >
> > Anyway, this past Saturday it happened again when I was not around.
> > Ultimately the system shut itself down, because the Red Hat cluster
> > software wasn't able to update its raw device any longer. So when I
> > examined the case, all I could go on were log files. Still I was able=
to
> > piece it together, I think.
> >
> > It started here:
> >
> > May 14 09:04:59 lvr1 imapd[14301]: login:
> > vpn82-4.vpn.Uni-Koeln.DE[134.95.82.4] s0283928 plaintext+TLS
> > May 14 09:04:59 lvr1 imapd[14301]: seen_db: user s0283928 opened
> > /var/lib/imap/user/B/s0283928.seen
> > May 14 09:04:59 lvr1 imapd[14301]: open: user s0283928 opened Drafts
> >
> > As you can see a VPN dial-up connection was used. I think it likely t=
hat
> > the notorious APPEND problem happened. i.e. the connection was droppe=
d by
> > the client while the server was doing an APPEND. This is bug 1270
> > <https://bugzilla.andrew.cmu.edu/show_bug.cgi?id=3D1270> and is marke=
d
> > FIXED. For completeness' sake: after the IMAP session above, hte lmtp=
d
> > processes started getting stuck when delivering to this user:
> >
> > May 14 15:59:44 lvr1 lmtpd[3879]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> > 0
> > May 14 16:42:57 lvr1 lmtpd[26559]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> > 0
> > May 14 17:26:45 lvr1 lmtpd[28635]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> > 0
> > May 14 18:08:01 lvr1 lmtpd[32017]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> > 0
> > May 14 18:43:25 lvr1 lmtpd[6860]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> > 0
> > May 14 19:34:23 lvr1 lmtpd[28641]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> > 0
> > May 14 19:49:37 lvr1 lmtpd[26587]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> > 0
> > May 14 20:44:45 lvr1 lmtpd[2960]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> > 0
> > May 14 21:24:31 lvr1 lmtpd[16921]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> > 0
> > May 14 21:45:35 lvr1 lmtpd[3571]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> > 0
> > May 14 22:25:25 lvr1 lmtpd[31517]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> > 0
> > May 14 22:59:53 lvr1 lmtpd[26060]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> > 0
> > May 14 23:37:39 lvr1 lmtpd[4317]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> > 0
> > May 15 00:08:15 lvr1 lmtpd[930]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> >
> > So, is this really fixed in 2.2.3? Has anybody seen something like th=
is
> > since? And would it be OK to apply the patch to 2.1.16? Or does it de=
pend
> > on other stuff that happened in CVS between 2.1.16 and the time the p=
atch
> > was created?
> >
> > Thanks, Sebastian Hagedorn
> > --
> > Sebastian Hagedorn M.A. - RZKR-R1 (Geb=E4ude 52), Zimmer 18
> > Zentrum f=FCr angewandte Informatik - Universit=E4tsweiter Service RR=
ZK
> > Universit=E4t zu K=F6ln / Cologne University - Tel. +49-221-478-5587
>
>
>
>
--=20
Igor
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
More information about the Info-cyrus
mailing list