IMAP/LMTP/Quota locking problem

Igor Brezac igor at ipass.net
Mon May 17 17:32:14 EDT 2004



On Mon, 17 May 2004, Scott Adkins wrote:

> I do know that as of 2.2.1, the problem still exists.  I find it imposs=
ible
> to trigger the bug on my own in our test environment, and we won't be d=
oing
> an upgrade to 2.2.3 (or whatever happens to be out) until the summer ti=
me
> frame.
>
> Currently, we manage the problem by watching the sendmail syslog file f=
or
> System I/O errors on the Cyrus mailer (grep for 'mailer=3Dcyrus.*System=
.I.O')
> and when we see some, we run another script that tells us which cyrus p=
roc
> has the write lock on their quota file and we kill that process.
>
> The problem occurs regularly, with some users more reliably triggering =
the
> problem than others.
>
> For those who are curious, our Cyrus server runs on a cluster of 3 mach=
ines
> running with Alpha Tru64 5.2.

I have not seen this happen with lmtpd, but I do get an occasional such
error when a mailer (pine and OE) saves a message to the sent
folder via imap protocol:

May 17 10:35:56 imaps[9711]: [ID 136705 local6.error] IOERROR: opening /e=
xport/cyrus/mail/M/user/igor/sent-mail/cyrus.header: No such file or dire=
ctory

Unfortunately, I cannot reproduce this problem because it occurs very
infrequently.  This particular server runs cyrus v2.2.2.  I will try cvs
head.

-Igor

> Scott
>
> --On Monday, May 17, 2004 2:47 PM +0200 Sebastian Hagedorn
> <Hagedorn at uni-koeln.de> wrote:
>
> > Hi,
> >
> > sorry, this is quite long ... I'm looking for the updated status of t=
he
> > problem discussed in the threads "stuck lmtpd processes":
> >
> > <http://asg.web.cmu.edu/archive/message.php?mailbox=3Darchive.info-cy=
rus&ms
> > g=3D24927>
> >
> > and "followup: stuck lmtpd processes":
> >
> > <http://asg.web.cmu.edu/archive/message.php?mailbox=3Darchive.info-cy=
rus&se
> > archterm=3Dstuck%20lmtpd&msg=3D24934>
> >
> > AFAICT Cyrus IMAP 2.1.x hasn't been updated since. Could anyone comme=
nt
> > on whether this is fixed in 2.2.3 and if we *have* to update in order=
 to
> > get that fix? In the last two weeks we've had this (or a very similar=
)
> > problem twice with 2.1.16 und Red Hat AS 2.1.
> >
> > For some reason in both cases tons of sendmail processes were buildin=
g
> > up. We have configured sendmail to deliver in "interactive" mode, so =
each
> > instance tries to connect to lmtpd. I would've expected those attempt=
s to
> > time out relatively quickly. I'm not sure why they don't. Anyway, tha=
t's
> > somewhat off-topic ...
> >
> > The first time it happened I was around and collected some info while=
 it
> > was happening. I'd noticed that mail was no longer being delivered. W=
e
> > have maxprocs set to 30 for lmtpd:
> >
> > [root at lvr1 root]# ps -aef|grep lmtp
> > cyrus     1383  2095  0 10:10 ?        00:00:18 lmtpd
> > cyrus     1974  2095  0 10:11 ?        00:00:00 lmtpd
> > cyrus    10631  2095  0 10:36 ?        00:00:00 lmtpd
> > cyrus    17162  2095  0 10:52 ?        00:00:00 lmtpd
> > cyrus    17262  2095  0 10:52 ?        00:00:06 lmtpd
> > cyrus    25687  2095  0 11:17 ?        00:00:00 lmtpd
> > cyrus    28377  2095  0 11:23 ?        00:00:09 lmtpd
> > cyrus    31251  2095  0 11:32 ?        00:00:00 lmtpd
> > cyrus     3824  2095  0 11:46 ?        00:00:07 lmtpd
> > cyrus     7534  2095  0 11:56 ?        00:00:00 lmtpd
> > cyrus     8248  2095  0 11:58 ?        00:00:08 lmtpd
> > cyrus    22103  2095  0 12:35 ?        00:00:00 lmtpd
> > cyrus    23725  2095  0 12:39 ?        00:00:01 lmtpd
> > cyrus    25366  2095  0 12:45 ?        00:00:00 lmtpd
> > cyrus     6108  2095  0 13:23 ?        00:00:00 lmtpd
> > cyrus    10832  2095  0 13:37 ?        00:00:00 lmtpd
> > cyrus    12972  2095  0 13:42 ?        00:00:00 lmtpd
> > cyrus    16114  2095  0 13:53 ?        00:00:00 lmtpd
> > cyrus    21377  2095  0 14:07 ?        00:00:09 lmtpd
> > cyrus    24581  2095  0 14:16 ?        00:00:00 lmtpd
> > cyrus    26200  2095  0 14:21 ?        00:00:14 lmtpd
> > cyrus    26577  2095  0 14:22 ?        00:00:00 lmtpd
> > cyrus    31893  2095  0 14:35 ?        00:00:03 lmtpd
> > cyrus    32575  2095  0 14:36 ?        00:00:07 lmtpd
> > cyrus    32659  2095  0 14:37 ?        00:00:01 lmtpd
> > cyrus    14146  2095  0 15:16 ?        00:00:00 lmtpd
> > cyrus    21049  2095  0 15:36 ?        00:00:00 lmtpd
> > cyrus    26969  2095  0 15:56 ?        00:00:11 lmtpd
> > cyrus    13275  2095  0 15:15 ?        00:00:04 lmtpd
> > cyrus    17679  2095  0 13:57 ?        00:00:00 lmtpd
> >
> > Obviously something was wrong.
> >
> > [root at lvr1 root]# lsof -p 26969
> > COMMAND   PID  USER   FD   TYPE     DEVICE     SIZE      NODE NAME
> > lmtpd   26969 cyrus  cwd    DIR        9,0     4096    393296 /root
> > <snip>
> > lmtpd   26969 cyrus  mem    REG       8,73  3841576        30
> > /var/lib/imap/mailboxes.db
> > lmtpd   26969 cyrus  mem    REG       8,49      194 145326099
> > /var/spool/imap/M/user/therbst/cyrus.header
> > <snip>
> > lmtpd   26969 cyrus    9u   REG       8,49     3130  56013510
> > /var/spool/imap/stage./26969-1083938870
> > lmtpd   26969 cyrus   10u  unix 0xef1ce040          614472274 socket
> > lmtpd   26969 cyrus   11u   REG       8,49      194 145326099
> > /var/spool/imap/M/user/therbst/cyrus.header
> >
> > I gathered that /var/spool/imap/M/user/therbst/cyrus.header was causi=
ng
> > the problems.Its node number (is that the inode??) is 145326099:
> >
> > [root at lvr1 root]# grep 145326099 /proc/locks
> > 1162: POSIX  ADVISORY  WRITE 1974 08:31:145326099 0 EOF e9b58d54 e9b5=
87f4
> > cd5a7964 00000000 d1c324c0
> > 1162: -> POSIX  ADVISORY  WRITE 1383 08:31:145326099 0 EOF d1c324b4
> > c7fd745c f3bd0964 e9b58d54 c7fd7464
> > 1162: -> POSIX  ADVISORY  WRITE 10631 08:31:145326099 0 EOF c7fd7458
> > d1c3245c d1c324b8 e9b58d54 d1c32464
> > 1162: -> POSIX  ADVISORY  WRITE 17162 08:31:145326099 0 EOF d1c32458
> > d6989400 c7fd745c e9b58d54 d6989408
> > 1162: -> POSIX  ADVISORY  WRITE 17262 08:31:145326099 0 EOF d69893fc
> > dfb2eca0 d1c3245c e9b58d54 dfb2eca8
> > 1162: -> POSIX  ADVISORY  WRITE 25687 08:31:145326099 0 EOF dfb2ec9c
> > f3972be8 d6989400 e9b58d54 f3972bf0
> > 1162: -> POSIX  ADVISORY  WRITE 28377 08:31:145326099 0 EOF f3972be4
> > c47d96e0 dfb2eca0 e9b58d54 c47d96e8
> > 1162: -> POSIX  ADVISORY  WRITE 31251 08:31:145326099 0 EOF c47d96dc
> > cd5a7d58 f3972be8 e9b58d54 cd5a7d60
> > 1162: -> POSIX  ADVISORY  WRITE 7534 08:31:145326099 0 EOF cd5a7d54
> > c47d98ac c47d96e0 e9b58d54 c47d98b4
> > 1162: -> POSIX  ADVISORY  WRITE 3824 08:31:145326099 0 EOF c47d98a8
> > c7fd7400 cd5a7d58 e9b58d54 c7fd7408
> > 1162: -> POSIX  ADVISORY  WRITE 8248 08:31:145326099 0 EOF c7fd73fc
> > c8733be8 c47d98ac e9b58d54 c8733bf0
> > 1162: -> POSIX  ADVISORY  WRITE 22103 08:31:145326099 0 EOF c8733be4
> > cd197f24 c7fd7400 e9b58d54 cd197f2c
> > 1162: -> POSIX  ADVISORY  WRITE 25366 08:31:145326099 0 EOF cd197f20
> > d1c32290 c8733be8 e9b58d54 d1c32298
> > 1162: -> POSIX  ADVISORY  WRITE 23725 08:31:145326099 0 EOF d1c3228c
> > c47d9964 cd197f24 e9b58d54 c47d996c
> > 1162: -> POSIX  ADVISORY  WRITE 6108 08:31:145326099 0 EOF c47d9960
> > f3bd0234 d1c32290 e9b58d54 f3bd023c
> > 1162: -> POSIX  ADVISORY  WRITE 10832 08:31:145326099 0 EOF f3bd0230
> > c47d9290 c47d9964 e9b58d54 c47d9298
> > 1162: -> POSIX  ADVISORY  WRITE 12972 08:31:145326099 0 EOF c47d928c
> > d1b450c4 f3bd0234 e9b58d54 d1b450cc
> > 1162: -> POSIX  ADVISORY  WRITE 16114 08:31:145326099 0 EOF d1b450c0
> > dfb2e120 c47d9290 e9b58d54 dfb2e128
> > 1162: -> POSIX  ADVISORY  WRITE 26577 08:31:145326099 0 EOF dfb2e11c
> > f6bfb7f4 d1b450c4 e9b58d54 f6bfb7fc
> > 1162: -> POSIX  ADVISORY  WRITE 24581 08:31:145326099 0 EOF f6bfb7f0
> > f6bfb3a4 dfb2e120 e9b58d54 f6bfb3ac
> > 1162: -> POSIX  ADVISORY  WRITE 21377 08:31:145326099 0 EOF f6bfb3a0
> > c87333a4 f6bfb7f4 e9b58d54 c87333ac
> > 1162: -> POSIX  ADVISORY  WRITE 32575 08:31:145326099 0 EOF c87333a0
> > d1c32628 f6bfb3a4 e9b58d54 d1c32630
> > 1162: -> POSIX  ADVISORY  WRITE 32659 08:31:145326099 0 EOF d1c32624
> > edb35e10 c87333a4 e9b58d54 edb35e18
> > 1162: -> POSIX  ADVISORY  WRITE 31893 08:31:145326099 0 EOF edb35e0c
> > edb35798 d1c32628 e9b58d54 edb357a0
> > 1162: -> POSIX  ADVISORY  WRITE 14146 08:31:145326099 0 EOF edb35794
> > cdf5ff24 edb35e10 e9b58d54 cdf5ff2c
> > 1162: -> POSIX  ADVISORY  WRITE 21049 08:31:145326099 0 EOF cdf5ff20
> > e8adbca0 edb35798 e9b58d54 e8adbca8
> > 1162: -> POSIX  ADVISORY  WRITE 26200 08:31:145326099 0 EOF e8adbc9c
> > cd5a7798 cdf5ff24 e9b58d54 cd5a77a0
> > 1162: -> POSIX  ADVISORY  WRITE 26969 08:31:145326099 0 EOF cd5a7794
> > d1695628 e8adbca0 e9b58d54 d1695630
> > 1162: -> POSIX  ADVISORY  WRITE 13275 08:31:145326099 0 EOF d1695624
> > d1b45e6c cd5a7798 e9b58d54 d1b45e74
> > 1162: -> POSIX  ADVISORY  WRITE 17679 08:31:145326099 0 EOF d1b45e68
> > cdf5f5cc d1695628 e9b58d54 e9b58d60
> >
> > The PIDs waiting for locks correspond to the PIDs of the stuck lmtpd
> > processes. At that point I restarted Cyrus, because I wanted to get t=
he
> > system going again. Only later did I read that the actual problem is =
more
> > likely caused by a locked quota file. Unfortunately I did not further
> > examine PID 1974, which held the lock.
> >
> > Anyway, this past Saturday it happened again when I was not around.
> > Ultimately the system shut itself down, because the Red Hat cluster
> > software wasn't able to update its raw device any longer. So when I
> > examined the case, all I could go on were log files. Still I was able=
 to
> > piece it together, I think.
> >
> > It started here:
> >
> > May 14 09:04:59 lvr1 imapd[14301]: login:
> > vpn82-4.vpn.Uni-Koeln.DE[134.95.82.4] s0283928 plaintext+TLS
> > May 14 09:04:59 lvr1 imapd[14301]: seen_db: user s0283928 opened
> > /var/lib/imap/user/B/s0283928.seen
> > May 14 09:04:59 lvr1 imapd[14301]: open: user s0283928 opened Drafts
> >
> > As you can see a VPN dial-up connection was used. I think it likely t=
hat
> > the notorious APPEND problem happened. i.e. the connection was droppe=
d by
> > the client while the server was doing an APPEND. This is bug 1270
> > <https://bugzilla.andrew.cmu.edu/show_bug.cgi?id=3D1270> and is marke=
d
> > FIXED. For completeness' sake: after the IMAP session above, hte lmtp=
d
> > processes started getting stuck when delivering to this user:
> >
> > May 14 15:59:44 lvr1 lmtpd[3879]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> >        0
> > May 14 16:42:57 lvr1 lmtpd[26559]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> >         0
> > May 14 17:26:45 lvr1 lmtpd[28635]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> >         0
> > May 14 18:08:01 lvr1 lmtpd[32017]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> >         0
> > May 14 18:43:25 lvr1 lmtpd[6860]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> >        0
> > May 14 19:34:23 lvr1 lmtpd[28641]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> >         0
> > May 14 19:49:37 lvr1 lmtpd[26587]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> >         0
> > May 14 20:44:45 lvr1 lmtpd[2960]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> >        0
> > May 14 21:24:31 lvr1 lmtpd[16921]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> >         0
> > May 14 21:45:35 lvr1 lmtpd[3571]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> >        0
> > May 14 22:25:25 lvr1 lmtpd[31517]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> >         0
> > May 14 22:59:53 lvr1 lmtpd[26060]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> >         0
> > May 14 23:37:39 lvr1 lmtpd[4317]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> >        0
> > May 15 00:08:15 lvr1 lmtpd[930]: duplicate_check:
> > <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> > user.s0283928
> >
> > So, is this really fixed in 2.2.3? Has anybody seen something like th=
is
> > since? And would it be OK to apply the patch to 2.1.16? Or does it de=
pend
> > on other stuff that happened in CVS between 2.1.16 and the time the p=
atch
> > was created?
> >
> > Thanks, Sebastian Hagedorn
> > --
> > Sebastian Hagedorn M.A. - RZKR-R1 (Geb=E4ude 52), Zimmer 18
> > Zentrum f=FCr angewandte Informatik - Universit=E4tsweiter Service RR=
ZK
> > Universit=E4t zu K=F6ln / Cologne University - Tel. +49-221-478-5587
>
>
>
>

--=20
Igor

---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html




More information about the Info-cyrus mailing list