IMAP/LMTP/Quota locking problem

Sebastian Hagedorn Hagedorn at uni-koeln.de
Mon May 17 08:47:33 EDT 2004


Hi,

sorry, this is quite long ... I'm looking for the updated status of the 
problem discussed in the threads "stuck lmtpd processes":

<http://asg.web.cmu.edu/archive/message.php?mailbox=archive.info-cyrus&msg=24927>

and "followup: stuck lmtpd processes":

<http://asg.web.cmu.edu/archive/message.php?mailbox=archive.info-cyrus&searchterm=stuck%20lmtpd&msg=24934>

AFAICT Cyrus IMAP 2.1.x hasn't been updated since. Could anyone comment on 
whether this is fixed in 2.2.3 and if we *have* to update in order to get 
that fix? In the last two weeks we've had this (or a very similar) problem 
twice with 2.1.16 und Red Hat AS 2.1.

For some reason in both cases tons of sendmail processes were building up. 
We have configured sendmail to deliver in "interactive" mode, so each 
instance tries to connect to lmtpd. I would've expected those attempts to 
time out relatively quickly. I'm not sure why they don't. Anyway, that's 
somewhat off-topic ...

The first time it happened I was around and collected some info while it 
was happening. I'd noticed that mail was no longer being delivered. We have 
maxprocs set to 30 for lmtpd:

[root at lvr1 root]# ps -aef|grep lmtp
cyrus     1383  2095  0 10:10 ?        00:00:18 lmtpd
cyrus     1974  2095  0 10:11 ?        00:00:00 lmtpd
cyrus    10631  2095  0 10:36 ?        00:00:00 lmtpd
cyrus    17162  2095  0 10:52 ?        00:00:00 lmtpd
cyrus    17262  2095  0 10:52 ?        00:00:06 lmtpd
cyrus    25687  2095  0 11:17 ?        00:00:00 lmtpd
cyrus    28377  2095  0 11:23 ?        00:00:09 lmtpd
cyrus    31251  2095  0 11:32 ?        00:00:00 lmtpd
cyrus     3824  2095  0 11:46 ?        00:00:07 lmtpd
cyrus     7534  2095  0 11:56 ?        00:00:00 lmtpd
cyrus     8248  2095  0 11:58 ?        00:00:08 lmtpd
cyrus    22103  2095  0 12:35 ?        00:00:00 lmtpd
cyrus    23725  2095  0 12:39 ?        00:00:01 lmtpd
cyrus    25366  2095  0 12:45 ?        00:00:00 lmtpd
cyrus     6108  2095  0 13:23 ?        00:00:00 lmtpd
cyrus    10832  2095  0 13:37 ?        00:00:00 lmtpd
cyrus    12972  2095  0 13:42 ?        00:00:00 lmtpd
cyrus    16114  2095  0 13:53 ?        00:00:00 lmtpd
cyrus    21377  2095  0 14:07 ?        00:00:09 lmtpd
cyrus    24581  2095  0 14:16 ?        00:00:00 lmtpd
cyrus    26200  2095  0 14:21 ?        00:00:14 lmtpd
cyrus    26577  2095  0 14:22 ?        00:00:00 lmtpd
cyrus    31893  2095  0 14:35 ?        00:00:03 lmtpd
cyrus    32575  2095  0 14:36 ?        00:00:07 lmtpd
cyrus    32659  2095  0 14:37 ?        00:00:01 lmtpd
cyrus    14146  2095  0 15:16 ?        00:00:00 lmtpd
cyrus    21049  2095  0 15:36 ?        00:00:00 lmtpd
cyrus    26969  2095  0 15:56 ?        00:00:11 lmtpd
cyrus    13275  2095  0 15:15 ?        00:00:04 lmtpd
cyrus    17679  2095  0 13:57 ?        00:00:00 lmtpd

Obviously something was wrong.

[root at lvr1 root]# lsof -p 26969
COMMAND   PID  USER   FD   TYPE     DEVICE     SIZE      NODE NAME
lmtpd   26969 cyrus  cwd    DIR        9,0     4096    393296 /root
<snip>
lmtpd   26969 cyrus  mem    REG       8,73  3841576        30 
/var/lib/imap/mailboxes.db
lmtpd   26969 cyrus  mem    REG       8,49      194 145326099 
/var/spool/imap/M/user/therbst/cyrus.header
<snip>
lmtpd   26969 cyrus    9u   REG       8,49     3130  56013510 
/var/spool/imap/stage./26969-1083938870
lmtpd   26969 cyrus   10u  unix 0xef1ce040          614472274 socket
lmtpd   26969 cyrus   11u   REG       8,49      194 145326099 
/var/spool/imap/M/user/therbst/cyrus.header

I gathered that /var/spool/imap/M/user/therbst/cyrus.header was causing the 
problems.Its node number (is that the inode??) is 145326099:

[root at lvr1 root]# grep 145326099 /proc/locks
1162: POSIX  ADVISORY  WRITE 1974 08:31:145326099 0 EOF e9b58d54 e9b587f4 
cd5a7964 00000000 d1c324c0
1162: -> POSIX  ADVISORY  WRITE 1383 08:31:145326099 0 EOF d1c324b4 
c7fd745c f3bd0964 e9b58d54 c7fd7464
1162: -> POSIX  ADVISORY  WRITE 10631 08:31:145326099 0 EOF c7fd7458 
d1c3245c d1c324b8 e9b58d54 d1c32464
1162: -> POSIX  ADVISORY  WRITE 17162 08:31:145326099 0 EOF d1c32458 
d6989400 c7fd745c e9b58d54 d6989408
1162: -> POSIX  ADVISORY  WRITE 17262 08:31:145326099 0 EOF d69893fc 
dfb2eca0 d1c3245c e9b58d54 dfb2eca8
1162: -> POSIX  ADVISORY  WRITE 25687 08:31:145326099 0 EOF dfb2ec9c 
f3972be8 d6989400 e9b58d54 f3972bf0
1162: -> POSIX  ADVISORY  WRITE 28377 08:31:145326099 0 EOF f3972be4 
c47d96e0 dfb2eca0 e9b58d54 c47d96e8
1162: -> POSIX  ADVISORY  WRITE 31251 08:31:145326099 0 EOF c47d96dc 
cd5a7d58 f3972be8 e9b58d54 cd5a7d60
1162: -> POSIX  ADVISORY  WRITE 7534 08:31:145326099 0 EOF cd5a7d54 
c47d98ac c47d96e0 e9b58d54 c47d98b4
1162: -> POSIX  ADVISORY  WRITE 3824 08:31:145326099 0 EOF c47d98a8 
c7fd7400 cd5a7d58 e9b58d54 c7fd7408
1162: -> POSIX  ADVISORY  WRITE 8248 08:31:145326099 0 EOF c7fd73fc 
c8733be8 c47d98ac e9b58d54 c8733bf0
1162: -> POSIX  ADVISORY  WRITE 22103 08:31:145326099 0 EOF c8733be4 
cd197f24 c7fd7400 e9b58d54 cd197f2c
1162: -> POSIX  ADVISORY  WRITE 25366 08:31:145326099 0 EOF cd197f20 
d1c32290 c8733be8 e9b58d54 d1c32298
1162: -> POSIX  ADVISORY  WRITE 23725 08:31:145326099 0 EOF d1c3228c 
c47d9964 cd197f24 e9b58d54 c47d996c
1162: -> POSIX  ADVISORY  WRITE 6108 08:31:145326099 0 EOF c47d9960 
f3bd0234 d1c32290 e9b58d54 f3bd023c
1162: -> POSIX  ADVISORY  WRITE 10832 08:31:145326099 0 EOF f3bd0230 
c47d9290 c47d9964 e9b58d54 c47d9298
1162: -> POSIX  ADVISORY  WRITE 12972 08:31:145326099 0 EOF c47d928c 
d1b450c4 f3bd0234 e9b58d54 d1b450cc
1162: -> POSIX  ADVISORY  WRITE 16114 08:31:145326099 0 EOF d1b450c0 
dfb2e120 c47d9290 e9b58d54 dfb2e128
1162: -> POSIX  ADVISORY  WRITE 26577 08:31:145326099 0 EOF dfb2e11c 
f6bfb7f4 d1b450c4 e9b58d54 f6bfb7fc
1162: -> POSIX  ADVISORY  WRITE 24581 08:31:145326099 0 EOF f6bfb7f0 
f6bfb3a4 dfb2e120 e9b58d54 f6bfb3ac
1162: -> POSIX  ADVISORY  WRITE 21377 08:31:145326099 0 EOF f6bfb3a0 
c87333a4 f6bfb7f4 e9b58d54 c87333ac
1162: -> POSIX  ADVISORY  WRITE 32575 08:31:145326099 0 EOF c87333a0 
d1c32628 f6bfb3a4 e9b58d54 d1c32630
1162: -> POSIX  ADVISORY  WRITE 32659 08:31:145326099 0 EOF d1c32624 
edb35e10 c87333a4 e9b58d54 edb35e18
1162: -> POSIX  ADVISORY  WRITE 31893 08:31:145326099 0 EOF edb35e0c 
edb35798 d1c32628 e9b58d54 edb357a0
1162: -> POSIX  ADVISORY  WRITE 14146 08:31:145326099 0 EOF edb35794 
cdf5ff24 edb35e10 e9b58d54 cdf5ff2c
1162: -> POSIX  ADVISORY  WRITE 21049 08:31:145326099 0 EOF cdf5ff20 
e8adbca0 edb35798 e9b58d54 e8adbca8
1162: -> POSIX  ADVISORY  WRITE 26200 08:31:145326099 0 EOF e8adbc9c 
cd5a7798 cdf5ff24 e9b58d54 cd5a77a0
1162: -> POSIX  ADVISORY  WRITE 26969 08:31:145326099 0 EOF cd5a7794 
d1695628 e8adbca0 e9b58d54 d1695630
1162: -> POSIX  ADVISORY  WRITE 13275 08:31:145326099 0 EOF d1695624 
d1b45e6c cd5a7798 e9b58d54 d1b45e74
1162: -> POSIX  ADVISORY  WRITE 17679 08:31:145326099 0 EOF d1b45e68 
cdf5f5cc d1695628 e9b58d54 e9b58d60

The PIDs waiting for locks correspond to the PIDs of the stuck lmtpd 
processes. At that point I restarted Cyrus, because I wanted to get the 
system going again. Only later did I read that the actual problem is more 
likely caused by a locked quota file. Unfortunately I did not further 
examine PID 1974, which held the lock.

Anyway, this past Saturday it happened again when I was not around. 
Ultimately the system shut itself down, because the Red Hat cluster 
software wasn't able to update its raw device any longer. So when I 
examined the case, all I could go on were log files. Still I was able to 
piece it together, I think.

It started here:

May 14 09:04:59 lvr1 imapd[14301]: login: 
vpn82-4.vpn.Uni-Koeln.DE[134.95.82.4] s0283928 plaintext+TLS
May 14 09:04:59 lvr1 imapd[14301]: seen_db: user s0283928 opened 
/var/lib/imap/user/B/s0283928.seen
May 14 09:04:59 lvr1 imapd[14301]: open: user s0283928 opened Drafts

As you can see a VPN dial-up connection was used. I think it likely that 
the notorious APPEND problem happened. i.e. the connection was dropped by 
the client while the server was doing an APPEND. This is bug 1270 
<https://bugzilla.andrew.cmu.edu/show_bug.cgi?id=1270> and is marked FIXED. 
For completeness' sake: after the IMAP session above, hte lmtpd processes 
started getting stuck when delivering to this user:

May 14 15:59:44 lvr1 lmtpd[3879]: duplicate_check: 
<F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx> user.s0283928
       0
May 14 16:42:57 lvr1 lmtpd[26559]: duplicate_check: 
<F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx> user.s0283928
        0
May 14 17:26:45 lvr1 lmtpd[28635]: duplicate_check: 
<F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx> user.s0283928
        0
May 14 18:08:01 lvr1 lmtpd[32017]: duplicate_check: 
<F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx> user.s0283928
        0
May 14 18:43:25 lvr1 lmtpd[6860]: duplicate_check: 
<F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx> user.s0283928
       0
May 14 19:34:23 lvr1 lmtpd[28641]: duplicate_check: 
<F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx> user.s0283928
        0
May 14 19:49:37 lvr1 lmtpd[26587]: duplicate_check: 
<F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx> user.s0283928
        0
May 14 20:44:45 lvr1 lmtpd[2960]: duplicate_check: 
<F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx> user.s0283928
       0
May 14 21:24:31 lvr1 lmtpd[16921]: duplicate_check: 
<F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx> user.s0283928
        0
May 14 21:45:35 lvr1 lmtpd[3571]: duplicate_check: 
<F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx> user.s0283928
       0
May 14 22:25:25 lvr1 lmtpd[31517]: duplicate_check: 
<F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx> user.s0283928
        0
May 14 22:59:53 lvr1 lmtpd[26060]: duplicate_check: 
<F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx> user.s0283928
        0
May 14 23:37:39 lvr1 lmtpd[4317]: duplicate_check: 
<F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx> user.s0283928
       0
May 15 00:08:15 lvr1 lmtpd[930]: duplicate_check: 
<F7152D0EDA04D3119A210008C789EB1113302803 at pcaple01.vw.com.mx> user.s0283928 


So, is this really fixed in 2.2.3? Has anybody seen something like this 
since? And would it be OK to apply the patch to 2.1.16? Or does it depend 
on other stuff that happened in CVS between 2.1.16 and the time the patch 
was created?

Thanks, Sebastian Hagedorn
--
Sebastian Hagedorn M.A. - RZKR-R1 (Gebäude 52), Zimmer 18
Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
Universität zu Köln / Cologne University - Tel. +49-221-478-5587
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
Url : https://lists.andrew.cmu.edu/mailman/private/info-cyrus/attachments/20040517/87cf76ee/attachment.bin


More information about the Info-cyrus mailing list