IMAP/LMTP/Quota locking problem

Scott Adkins adkinss at ohio.edu
Mon May 17 11:03:18 EDT 2004


I do know that as of 2.2.1, the problem still exists.  I find it impossible
to trigger the bug on my own in our test environment, and we won't be doing
an upgrade to 2.2.3 (or whatever happens to be out) until the summer time
frame.

Currently, we manage the problem by watching the sendmail syslog file for
System I/O errors on the Cyrus mailer (grep for 'mailer=cyrus.*System.I.O')
and when we see some, we run another script that tells us which cyrus proc
has the write lock on their quota file and we kill that process.

The problem occurs regularly, with some users more reliably triggering the
problem than others.

For those who are curious, our Cyrus server runs on a cluster of 3 machines
running with Alpha Tru64 5.2.

Scott

--On Monday, May 17, 2004 2:47 PM +0200 Sebastian Hagedorn 
<Hagedorn at uni-koeln.de> wrote:

> Hi,
>
> sorry, this is quite long ... I'm looking for the updated status of the
> problem discussed in the threads "stuck lmtpd processes":
>
> <http://asg.web.cmu.edu/archive/message.php?mailbox=archive.info-cyrus&ms
> g=24927>
>
> and "followup: stuck lmtpd processes":
>
> <http://asg.web.cmu.edu/archive/message.php?mailbox=archive.info-cyrus&se
> archterm=stuck%20lmtpd&msg=24934>
>
> AFAICT Cyrus IMAP 2.1.x hasn't been updated since. Could anyone comment
> on whether this is fixed in 2.2.3 and if we *have* to update in order to
> get that fix? In the last two weeks we've had this (or a very similar)
> problem twice with 2.1.16 und Red Hat AS 2.1.
>
> For some reason in both cases tons of sendmail processes were building
> up. We have configured sendmail to deliver in "interactive" mode, so each
> instance tries to connect to lmtpd. I would've expected those attempts to
> time out relatively quickly. I'm not sure why they don't. Anyway, that's
> somewhat off-topic ...
>
> The first time it happened I was around and collected some info while it
> was happening. I'd noticed that mail was no longer being delivered. We
> have maxprocs set to 30 for lmtpd:
>
> [root at lvr1 root]# ps -aef|grep lmtp
> cyrus     1383  2095  0 10:10 ?        00:00:18 lmtpd
> cyrus     1974  2095  0 10:11 ?        00:00:00 lmtpd
> cyrus    10631  2095  0 10:36 ?        00:00:00 lmtpd
> cyrus    17162  2095  0 10:52 ?        00:00:00 lmtpd
> cyrus    17262  2095  0 10:52 ?        00:00:06 lmtpd
> cyrus    25687  2095  0 11:17 ?        00:00:00 lmtpd
> cyrus    28377  2095  0 11:23 ?        00:00:09 lmtpd
> cyrus    31251  2095  0 11:32 ?        00:00:00 lmtpd
> cyrus     3824  2095  0 11:46 ?        00:00:07 lmtpd
> cyrus     7534  2095  0 11:56 ?        00:00:00 lmtpd
> cyrus     8248  2095  0 11:58 ?        00:00:08 lmtpd
> cyrus    22103  2095  0 12:35 ?        00:00:00 lmtpd
> cyrus    23725  2095  0 12:39 ?        00:00:01 lmtpd
> cyrus    25366  2095  0 12:45 ?        00:00:00 lmtpd
> cyrus     6108  2095  0 13:23 ?        00:00:00 lmtpd
> cyrus    10832  2095  0 13:37 ?        00:00:00 lmtpd
> cyrus    12972  2095  0 13:42 ?        00:00:00 lmtpd
> cyrus    16114  2095  0 13:53 ?        00:00:00 lmtpd
> cyrus    21377  2095  0 14:07 ?        00:00:09 lmtpd
> cyrus    24581  2095  0 14:16 ?        00:00:00 lmtpd
> cyrus    26200  2095  0 14:21 ?        00:00:14 lmtpd
> cyrus    26577  2095  0 14:22 ?        00:00:00 lmtpd
> cyrus    31893  2095  0 14:35 ?        00:00:03 lmtpd
> cyrus    32575  2095  0 14:36 ?        00:00:07 lmtpd
> cyrus    32659  2095  0 14:37 ?        00:00:01 lmtpd
> cyrus    14146  2095  0 15:16 ?        00:00:00 lmtpd
> cyrus    21049  2095  0 15:36 ?        00:00:00 lmtpd
> cyrus    26969  2095  0 15:56 ?        00:00:11 lmtpd
> cyrus    13275  2095  0 15:15 ?        00:00:04 lmtpd
> cyrus    17679  2095  0 13:57 ?        00:00:00 lmtpd
>
> Obviously something was wrong.
>
> [root at lvr1 root]# lsof -p 26969
> COMMAND   PID  USER   FD   TYPE     DEVICE     SIZE      NODE NAME
> lmtpd   26969 cyrus  cwd    DIR        9,0     4096    393296 /root
> <snip>
> lmtpd   26969 cyrus  mem    REG       8,73  3841576        30
> /var/lib/imap/mailboxes.db
> lmtpd   26969 cyrus  mem    REG       8,49      194 145326099
> /var/spool/imap/M/user/therbst/cyrus.header
> <snip>
> lmtpd   26969 cyrus    9u   REG       8,49     3130  56013510
> /var/spool/imap/stage./26969-1083938870
> lmtpd   26969 cyrus   10u  unix 0xef1ce040          614472274 socket
> lmtpd   26969 cyrus   11u   REG       8,49      194 145326099
> /var/spool/imap/M/user/therbst/cyrus.header
>
> I gathered that /var/spool/imap/M/user/therbst/cyrus.header was causing
> the problems.Its node number (is that the inode??) is 145326099:
>
> [root at lvr1 root]# grep 145326099 /proc/locks
> 1162: POSIX  ADVISORY  WRITE 1974 08:31:145326099 0 EOF e9b58d54 e9b587f4
> cd5a7964 00000000 d1c324c0
> 1162: -> POSIX  ADVISORY  WRITE 1383 08:31:145326099 0 EOF d1c324b4
> c7fd745c f3bd0964 e9b58d54 c7fd7464
> 1162: -> POSIX  ADVISORY  WRITE 10631 08:31:145326099 0 EOF c7fd7458
> d1c3245c d1c324b8 e9b58d54 d1c32464
> 1162: -> POSIX  ADVISORY  WRITE 17162 08:31:145326099 0 EOF d1c32458
> d6989400 c7fd745c e9b58d54 d6989408
> 1162: -> POSIX  ADVISORY  WRITE 17262 08:31:145326099 0 EOF d69893fc
> dfb2eca0 d1c3245c e9b58d54 dfb2eca8
> 1162: -> POSIX  ADVISORY  WRITE 25687 08:31:145326099 0 EOF dfb2ec9c
> f3972be8 d6989400 e9b58d54 f3972bf0
> 1162: -> POSIX  ADVISORY  WRITE 28377 08:31:145326099 0 EOF f3972be4
> c47d96e0 dfb2eca0 e9b58d54 c47d96e8
> 1162: -> POSIX  ADVISORY  WRITE 31251 08:31:145326099 0 EOF c47d96dc
> cd5a7d58 f3972be8 e9b58d54 cd5a7d60
> 1162: -> POSIX  ADVISORY  WRITE 7534 08:31:145326099 0 EOF cd5a7d54
> c47d98ac c47d96e0 e9b58d54 c47d98b4
> 1162: -> POSIX  ADVISORY  WRITE 3824 08:31:145326099 0 EOF c47d98a8
> c7fd7400 cd5a7d58 e9b58d54 c7fd7408
> 1162: -> POSIX  ADVISORY  WRITE 8248 08:31:145326099 0 EOF c7fd73fc
> c8733be8 c47d98ac e9b58d54 c8733bf0
> 1162: -> POSIX  ADVISORY  WRITE 22103 08:31:145326099 0 EOF c8733be4
> cd197f24 c7fd7400 e9b58d54 cd197f2c
> 1162: -> POSIX  ADVISORY  WRITE 25366 08:31:145326099 0 EOF cd197f20
> d1c32290 c8733be8 e9b58d54 d1c32298
> 1162: -> POSIX  ADVISORY  WRITE 23725 08:31:145326099 0 EOF d1c3228c
> c47d9964 cd197f24 e9b58d54 c47d996c
> 1162: -> POSIX  ADVISORY  WRITE 6108 08:31:145326099 0 EOF c47d9960
> f3bd0234 d1c32290 e9b58d54 f3bd023c
> 1162: -> POSIX  ADVISORY  WRITE 10832 08:31:145326099 0 EOF f3bd0230
> c47d9290 c47d9964 e9b58d54 c47d9298
> 1162: -> POSIX  ADVISORY  WRITE 12972 08:31:145326099 0 EOF c47d928c
> d1b450c4 f3bd0234 e9b58d54 d1b450cc
> 1162: -> POSIX  ADVISORY  WRITE 16114 08:31:145326099 0 EOF d1b450c0
> dfb2e120 c47d9290 e9b58d54 dfb2e128
> 1162: -> POSIX  ADVISORY  WRITE 26577 08:31:145326099 0 EOF dfb2e11c
> f6bfb7f4 d1b450c4 e9b58d54 f6bfb7fc
> 1162: -> POSIX  ADVISORY  WRITE 24581 08:31:145326099 0 EOF f6bfb7f0
> f6bfb3a4 dfb2e120 e9b58d54 f6bfb3ac
> 1162: -> POSIX  ADVISORY  WRITE 21377 08:31:145326099 0 EOF f6bfb3a0
> c87333a4 f6bfb7f4 e9b58d54 c87333ac
> 1162: -> POSIX  ADVISORY  WRITE 32575 08:31:145326099 0 EOF c87333a0
> d1c32628 f6bfb3a4 e9b58d54 d1c32630
> 1162: -> POSIX  ADVISORY  WRITE 32659 08:31:145326099 0 EOF d1c32624
> edb35e10 c87333a4 e9b58d54 edb35e18
> 1162: -> POSIX  ADVISORY  WRITE 31893 08:31:145326099 0 EOF edb35e0c
> edb35798 d1c32628 e9b58d54 edb357a0
> 1162: -> POSIX  ADVISORY  WRITE 14146 08:31:145326099 0 EOF edb35794
> cdf5ff24 edb35e10 e9b58d54 cdf5ff2c
> 1162: -> POSIX  ADVISORY  WRITE 21049 08:31:145326099 0 EOF cdf5ff20
> e8adbca0 edb35798 e9b58d54 e8adbca8
> 1162: -> POSIX  ADVISORY  WRITE 26200 08:31:145326099 0 EOF e8adbc9c
> cd5a7798 cdf5ff24 e9b58d54 cd5a77a0
> 1162: -> POSIX  ADVISORY  WRITE 26969 08:31:145326099 0 EOF cd5a7794
> d1695628 e8adbca0 e9b58d54 d1695630
> 1162: -> POSIX  ADVISORY  WRITE 13275 08:31:145326099 0 EOF d1695624
> d1b45e6c cd5a7798 e9b58d54 d1b45e74
> 1162: -> POSIX  ADVISORY  WRITE 17679 08:31:145326099 0 EOF d1b45e68
> cdf5f5cc d1695628 e9b58d54 e9b58d60
>
> The PIDs waiting for locks correspond to the PIDs of the stuck lmtpd
> processes. At that point I restarted Cyrus, because I wanted to get the
> system going again. Only later did I read that the actual problem is more
> likely caused by a locked quota file. Unfortunately I did not further
> examine PID 1974, which held the lock.
>
> Anyway, this past Saturday it happened again when I was not around.
> Ultimately the system shut itself down, because the Red Hat cluster
> software wasn't able to update its raw device any longer. So when I
> examined the case, all I could go on were log files. Still I was able to
> piece it together, I think.
>
> It started here:
>
> May 14 09:04:59 lvr1 imapd[14301]: login:
> vpn82-4.vpn.Uni-Koeln.DE[134.95.82.4] s0283928 plaintext+TLS
> May 14 09:04:59 lvr1 imapd[14301]: seen_db: user s0283928 opened
> /var/lib/imap/user/B/s0283928.seen
> May 14 09:04:59 lvr1 imapd[14301]: open: user s0283928 opened Drafts
>
> As you can see a VPN dial-up connection was used. I think it likely that
> the notorious APPEND problem happened. i.e. the connection was dropped by
> the client while the server was doing an APPEND. This is bug 1270
> <https://bugzilla.andrew.cmu.edu/show_bug.cgi?id=1270> and is marked
> FIXED. For completeness' sake: after the IMAP session above, hte lmtpd
> processes started getting stuck when delivering to this user:
>
> May 14 15:59:44 lvr1 lmtpd[3879]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
>        0
> May 14 16:42:57 lvr1 lmtpd[26559]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
>         0
> May 14 17:26:45 lvr1 lmtpd[28635]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
>         0
> May 14 18:08:01 lvr1 lmtpd[32017]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
>         0
> May 14 18:43:25 lvr1 lmtpd[6860]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
>        0
> May 14 19:34:23 lvr1 lmtpd[28641]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
>         0
> May 14 19:49:37 lvr1 lmtpd[26587]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
>         0
> May 14 20:44:45 lvr1 lmtpd[2960]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
>        0
> May 14 21:24:31 lvr1 lmtpd[16921]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
>         0
> May 14 21:45:35 lvr1 lmtpd[3571]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
>        0
> May 14 22:25:25 lvr1 lmtpd[31517]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
>         0
> May 14 22:59:53 lvr1 lmtpd[26060]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
>         0
> May 14 23:37:39 lvr1 lmtpd[4317]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
>        0
> May 15 00:08:15 lvr1 lmtpd[930]: duplicate_check:
> <F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx>
> user.s0283928
>
> So, is this really fixed in 2.2.3? Has anybody seen something like this
> since? And would it be OK to apply the patch to 2.1.16? Or does it depend
> on other stuff that happened in CVS between 2.1.16 and the time the patch
> was created?
>
> Thanks, Sebastian Hagedorn
> --
> Sebastian Hagedorn M.A. - RZKR-R1 (Gebäude 52), Zimmer 18
> Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
> Universität zu Köln / Cologne University - Tel. +49-221-478-5587



-- 
 +-----------------------------------------------------------------------+
      Scott W. Adkins                http://www.cns.ohiou.edu/~sadkins/
   UNIX Systems Engineer                  mailto:adkinss at ohio.edu
        ICQ 7626282                 Work (740)593-9478 Fax (740)593-1944
 +-----------------------------------------------------------------------+
     PGP Public Key available at http://www.cns.ohiou.edu/~sadkins/pgp/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 231 bytes
Desc: not available
Url : https://lists.andrew.cmu.edu/mailman/private/info-cyrus/attachments/20040517/da3ccd9a/attachment.bin


More information about the Info-cyrus mailing list