IMAP/LMTP/Quota locking problem
Scott Adkins
adkinss at ohio.edu
Mon May 17 11:03:18 EDT 2004
I do know that as of 2.2.1, the problem still exists. I find it impossible
to trigger the bug on my own in our test environment, and we won't be doing
an upgrade to 2.2.3 (or whatever happens to be out) until the summer time
frame.
Currently, we manage the problem by watching the sendmail syslog file for
System I/O errors on the Cyrus mailer (grep for 'mailer=cyrus.*System.I.O')
and when we see some, we run another script that tells us which cyrus proc
has the write lock on their quota file and we kill that process.
The problem occurs regularly, with some users more reliably triggering the
problem than others.
For those who are curious, our Cyrus server runs on a cluster of 3 machines
running with Alpha Tru64 5.2.
Scott
--On Monday, May 17, 2004 2:47 PM +0200 Sebastian Hagedorn
<Hagedorn at uni-koeln.de> wrote:
> Hi,
>
> sorry, this is quite long ... I'm looking for the updated status of the
> problem discussed in the threads "stuck lmtpd processes":
>
> <http://asg.web.cmu.edu/archive/message.php?mailbox=archive.info-cyrus&ms
> g=24927>
>
> and "followup: stuck lmtpd processes":
>
> <http://asg.web.cmu.edu/archive/message.php?mailbox=archive.info-cyrus&se
> archterm=stuck%20lmtpd&msg=24934>
>
> AFAICT Cyrus IMAP 2.1.x hasn't been updated since. Could anyone comment
> on whether this is fixed in 2.2.3 and if we *have* to update in order to
> get that fix? In the last two weeks we've had this (or a very similar)
> problem twice with 2.1.16 und Red Hat AS 2.1.
>
> For some reason in both cases tons of sendmail processes were building
> up. We have configured sendmail to deliver in "interactive" mode, so each
> instance tries to connect to lmtpd. I would've expected those attempts to
> time out relatively quickly. I'm not sure why they don't. Anyway, that's
> somewhat off-topic ...
>
> The first time it happened I was around and collected some info while it
> was happening. I'd noticed that mail was no longer being delivered. We
> have maxprocs set to 30 for lmtpd:
>
> [root at lvr1 root]# ps -aef|grep lmtp
> cyrus 1383 2095 0 10:10 ? 00:00:18 lmtpd
> cyrus 1974 2095 0 10:11 ? 00:00:00 lmtpd
> cyrus 10631 2095 0 10:36 ? 00:00:00 lmtpd
> cyrus 17162 2095 0 10:52 ? 00:00:00 lmtpd
> cyrus 17262 2095 0 10:52 ? 00:00:06 lmtpd
> cyrus 25687 2095 0 11:17 ? 00:00:00 lmtpd
> cyrus 28377 2095 0 11:23 ? 00:00:09 lmtpd
> cyrus 31251 2095 0 11:32 ? 00:00:00 lmtpd
> cyrus 3824 2095 0 11:46 ? 00:00:07 lmtpd
> cyrus 7534 2095 0 11:56 ? 00:00:00 lmtpd
> cyrus 8248 2095 0 11:58 ? 00:00:08 lmtpd
> cyrus 22103 2095 0 12:35 ? 00:00:00 lmtpd
> cyrus 23725 2095 0 12:39 ? 00:00:01 lmtpd
> cyrus 25366 2095 0 12:45 ? 00:00:00 lmtpd
> cyrus 6108 2095 0 13:23 ? 00:00:00 lmtpd
> cyrus 10832 2095 0 13:37 ? 00:00:00 lmtpd
> cyrus 12972 2095 0 13:42 ? 00:00:00 lmtpd
> cyrus 16114 2095 0 13:53 ? 00:00:00 lmtpd
> cyrus 21377 2095 0 14:07 ? 00:00:09 lmtpd
> cyrus 24581 2095 0 14:16 ? 00:00:00 lmtpd
> cyrus 26200 2095 0 14:21 ? 00:00:14 lmtpd
> cyrus 26577 2095 0 14:22 ? 00:00:00 lmtpd
> cyrus 31893 2095 0 14:35 ? 00:00:03 lmtpd
> cyrus 32575 2095 0 14:36 ? 00:00:07 lmtpd
> cyrus 32659 2095 0 14:37 ? 00:00:01 lmtpd
> cyrus 14146 2095 0 15:16 ? 00:00:00 lmtpd
> cyrus 21049 2095 0 15:36 ? 00:00:00 lmtpd
> cyrus 26969 2095 0 15:56 ? 00:00:11 lmtpd
> cyrus 13275 2095 0 15:15 ? 00:00:04 lmtpd
> cyrus 17679 2095 0 13:57 ? 00:00:00 lmtpd
>
> Obviously something was wrong.
>
> [root at lvr1 root]# lsof -p 26969
> COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
> lmtpd 26969 cyrus cwd DIR 9,0 4096 393296 /root
> <snip>
> lmtpd 26969 cyrus mem REG 8,73 3841576 30
> /var/lib/imap/mailboxes.db
> lmtpd 26969 cyrus mem REG 8,49 194 145326099
> /var/spool/imap/M/user/therbst/cyrus.header
> <snip>
> lmtpd 26969 cyrus 9u REG 8,49 3130 56013510
> /var/spool/imap/stage./26969-1083938870
> lmtpd 26969 cyrus 10u unix 0xef1ce040 614472274 socket
> lmtpd 26969 cyrus 11u REG 8,49 194 145326099
> /var/spool/imap/M/user/therbst/cyrus.header
>
> I gathered that /var/spool/imap/M/user/therbst/cyrus.header was causing
> the problems.Its node number (is that the inode??) is 145326099:
>
> [root at lvr1 root]# grep 145326099 /proc/locks
> 1162: POSIX ADVISORY WRITE 1974 08:31:145326099 0 EOF e9b58d54 e9b587f4
> cd5a7964 00000000 d1c324c0
> 1162: -> POSIX ADVISORY WRITE 1383 08:31:145326099 0 EOF d1c324b4
> c7fd745c f3bd0964 e9b58d54 c7fd7464
> 1162: -> POSIX ADVISORY WRITE 10631 08:31:145326099 0 EOF c7fd7458
> d1c3245c d1c324b8 e9b58d54 d1c32464
> 1162: -> POSIX ADVISORY WRITE 17162 08:31:145326099 0 EOF d1c32458
> d6989400 c7fd745c e9b58d54 d6989408
> 1162: -> POSIX ADVISORY WRITE 17262 08:31:145326099 0 EOF d69893fc
> dfb2eca0 d1c3245c e9b58d54 dfb2eca8
> 1162: -> POSIX ADVISORY WRITE 25687 08:31:145326099 0 EOF dfb2ec9c
> f3972be8 d6989400 e9b58d54 f3972bf0
> 1162: -> POSIX ADVISORY WRITE 28377 08:31:145326099 0 EOF f3972be4
> c47d96e0 dfb2eca0 e9b58d54 c47d96e8
> 1162: -> POSIX ADVISORY WRITE 31251 08:31:145326099 0 EOF c47d96dc
> cd5a7d58 f3972be8 e9b58d54 cd5a7d60
> 1162: -> POSIX ADVISORY WRITE 7534 08:31:145326099 0 EOF cd5a7d54
> c47d98ac c47d96e0 e9b58d54 c47d98b4
> 1162: -> POSIX ADVISORY WRITE 3824 08:31:145326099 0 EOF c47d98a8
> c7fd7400 cd5a7d58 e9b58d54 c7fd7408
> 1162: -> POSIX ADVISORY WRITE 8248 08:31:145326099 0 EOF c7fd73fc
> c8733be8 c47d98ac e9b58d54 c8733bf0
> 1162: -> POSIX ADVISORY WRITE 22103 08:31:145326099 0 EOF c8733be4
> cd197f24 c7fd7400 e9b58d54 cd197f2c
> 1162: -> POSIX ADVISORY WRITE 25366 08:31:145326099 0 EOF cd197f20
> d1c32290 c8733be8 e9b58d54 d1c32298
> 1162: -> POSIX ADVISORY WRITE 23725 08:31:145326099 0 EOF d1c3228c
> c47d9964 cd197f24 e9b58d54 c47d996c
> 1162: -> POSIX ADVISORY WRITE 6108 08:31:145326099 0 EOF c47d9960
> f3bd0234 d1c32290 e9b58d54 f3bd023c
> 1162: -> POSIX ADVISORY WRITE 10832 08:31:145326099 0 EOF f3bd0230
> c47d9290 c47d9964 e9b58d54 c47d9298
> 1162: -> POSIX ADVISORY WRITE 12972 08:31:145326099 0 EOF c47d928c
> d1b450c4 f3bd0234 e9b58d54 d1b450cc
> 1162: -> POSIX ADVISORY WRITE 16114 08:31:145326099 0 EOF d1b450c0
> dfb2e120 c47d9290 e9b58d54 dfb2e128
> 1162: -> POSIX ADVISORY WRITE 26577 08:31:145326099 0 EOF dfb2e11c
> f6bfb7f4 d1b450c4 e9b58d54 f6bfb7fc
> 1162: -> POSIX ADVISORY WRITE 24581 08:31:145326099 0 EOF f6bfb7f0
> f6bfb3a4 dfb2e120 e9b58d54 f6bfb3ac
> 1162: -> POSIX ADVISORY WRITE 21377 08:31:145326099 0 EOF f6bfb3a0
> c87333a4 f6bfb7f4 e9b58d54 c87333ac
> 1162: -> POSIX ADVISORY WRITE 32575 08:31:145326099 0 EOF c87333a0
> d1c32628 f6bfb3a4 e9b58d54 d1c32630
> 1162: -> POSIX ADVISORY WRITE 32659 08:31:145326099 0 EOF d1c32624
> edb35e10 c87333a4 e9b58d54 edb35e18
> 1162: -> POSIX ADVISORY WRITE 31893 08:31:145326099 0 EOF edb35e0c
> edb35798 d1c32628 e9b58d54 edb357a0
> 1162: -> POSIX ADVISORY WRITE 14146 08:31:145326099 0 EOF edb35794
> cdf5ff24 edb35e10 e9b58d54 cdf5ff2c
> 1162: -> POSIX ADVISORY WRITE 21049 08:31:145326099 0 EOF cdf5ff20
> e8adbca0 edb35798 e9b58d54 e8adbca8
> 1162: -> POSIX ADVISORY WRITE 26200 08:31:145326099 0 EOF e8adbc9c
> cd5a7798 cdf5ff24 e9b58d54 cd5a77a0
> 1162: -> POSIX ADVISORY WRITE 26969 08:31:145326099 0 EOF cd5a7794
> d1695628 e8adbca0 e9b58d54 d1695630
> 1162: -> POSIX ADVISORY WRITE 13275 08:31:145326099 0 EOF d1695624
> d1b45e6c cd5a7798 e9b58d54 d1b45e74
> 1162: -> POSIX ADVISORY WRITE 17679 08:31:145326099 0 EOF d1b45e68
> cdf5f5cc d1695628 e9b58d54 e9b58d60
>
> The PIDs waiting for locks correspond to the PIDs of the stuck lmtpd
> processes. At that point I restarted Cyrus, because I wanted to get the
> system going again. Only later did I read that the actual problem is more
> likely caused by a locked quota file. Unfortunately I did not further
> examine PID 1974, which held the lock.
>
> Anyway, this past Saturday it happened again when I was not around.
> Ultimately the system shut itself down, because the Red Hat cluster
> software wasn't able to update its raw device any longer. So when I
> examined the case, all I could go on were log files. Still I was able to
> piece it together, I think.
>
> It started here:
>
> May 14 09:04:59 lvr1 imapd[14301]: login:
> vpn82-4.vpn.Uni-Koeln.DE[134.95.82.4] s0283928 plaintext+TLS
> May 14 09:04:59 lvr1 imapd[14301]: seen_db: user s0283928 opened
> /var/lib/imap/user/B/s0283928.seen
> May 14 09:04:59 lvr1 imapd[14301]: open: user s0283928 opened Drafts
>
> As you can see a VPN dial-up connection was used. I think it likely that
> the notorious APPEND problem happened. i.e. the connection was dropped by
> the client while the server was doing an APPEND. This is bug 1270
> <https://bugzilla.andrew.cmu.edu/show_bug.cgi?id=1270> and is marked
> FIXED. For completeness' sake: after the IMAP session above, hte lmtpd
> processes started getting stuck when delivering to this user:
>
> May 14 15:59:44 lvr1 lmtpd[3879]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
> 0
> May 14 16:42:57 lvr1 lmtpd[26559]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
> 0
> May 14 17:26:45 lvr1 lmtpd[28635]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
> 0
> May 14 18:08:01 lvr1 lmtpd[32017]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
> 0
> May 14 18:43:25 lvr1 lmtpd[6860]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
> 0
> May 14 19:34:23 lvr1 lmtpd[28641]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
> 0
> May 14 19:49:37 lvr1 lmtpd[26587]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
> 0
> May 14 20:44:45 lvr1 lmtpd[2960]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
> 0
> May 14 21:24:31 lvr1 lmtpd[16921]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
> 0
> May 14 21:45:35 lvr1 lmtpd[3571]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
> 0
> May 14 22:25:25 lvr1 lmtpd[31517]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
> 0
> May 14 22:59:53 lvr1 lmtpd[26060]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
> 0
> May 14 23:37:39 lvr1 lmtpd[4317]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
> 0
> May 15 00:08:15 lvr1 lmtpd[930]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
>
> So, is this really fixed in 2.2.3? Has anybody seen something like this
> since? And would it be OK to apply the patch to 2.1.16? Or does it depend
> on other stuff that happened in CVS between 2.1.16 and the time the patch
> was created?
>
> Thanks, Sebastian Hagedorn
> --
> Sebastian Hagedorn M.A. - RZKR-R1 (Gebäude 52), Zimmer 18
> Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
> Universität zu Köln / Cologne University - Tel. +49-221-478-5587
--
+-----------------------------------------------------------------------+
Scott W. Adkins http://www.cns.ohiou.edu/~sadkins/
UNIX Systems Engineer mailto:adkinss at ohio.edu
ICQ 7626282 Work (740)593-9478 Fax (740)593-1944
+-----------------------------------------------------------------------+
PGP Public Key available at http://www.cns.ohiou.edu/~sadkins/pgp/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 231 bytes
Desc: not available
Url : https://lists.andrew.cmu.edu/mailman/private/info-cyrus/attachments/20040517/da3ccd9a/attachment.bin
More information about the Info-cyrus
mailing list