Cyrus-imapd 2.4.17: processes stick on mailbox locking resulting in total mailsystem failure
Ivan Kuznetsov
kia at solvo.ru
Thu Dec 13 09:50:02 EST 2018
Jonk, thank you for the idea. Somewhat looks strange as old mail server
worked w/o this problem 5+ years. But the system environment changed
dramatically, may be some filesystem quircks are significant for this
locks...
I will try gdb'ing the process when problem occurs once more
13.12.2018 17:34, John Wade пишет:
> Without running gdb on the process, I have no idea, but your problem
> sounds similar to something we hit a very long time ago:
>
> See https://www.oakton.edu/user/3/jwade/cyrus/Readme.html
>
> In our cases, the problem was the imapd process that was holding the
> lock was trying to obtain a second lock on the same file. What does a
> stack trace look like on the imapd process that is holding the lock? It
> would appear the lock process has changed since I last looked at this,
> so this may not be a help at all.
>
> Good luck,
> John
>
>
>
> On 12/13/2018 5:21 AM, Ivan Kuznetsov wrote:
>> Hello
>>
>> We had a company mail server under Oracle Linux 6 (a clone of RHEL6)
>> with cyrus-imapd 2.3.16 working for years w/o problems. There are >9M
>> messages in ~9500 mailboxes in two partitions. Daily mail traffic is
>> 20-50K messages.
>>
>> Some weeks ago we migrated the server to a new one under Oracle Linux
>> 7 with cyrus-imapd 2.4.17 (cyrus-imapd-2.4.17-13.el7.x86_64 rpm) and
>> now have problem. Some times a week one of imapd processes locks an
>> "INBOX" mailbox with corresponding /var/lib/imap/lock/user/<user>.lock
>> file and does not unlock it anymore. Other imapd processes trying to
>> access this mailbox sticks waiting to obtain the lock. The bigger
>> problem is that lmtpd processes trying to deliver new mail to this
>> mailbox hangs too. The number of lmtpd processes is limited
>> (maxchild=32) to limit the server load, so free lmtpd pool become
>> empty after a time, and mail delivery stopsto all the mailboxes. MTA
>> (postfix) mail queue blows up quickly
>>
>> Example lsof output:
>>
>> lmtpd 8182 cyrus 18uR REG 249,0 0 202944402
>> /var/lib/imap/lock/user/smilovsky.lock
>> lmtpd 8183 cyrus 18uR REG 249,0 0 202944402
>> /var/lib/imap/lock/user/smilovsky.lock
>> lmtpd 8187 cyrus 18uR REG 249,0 0 202944402
>> /var/lib/imap/lock/user/smilovsky.lock
>> lmtpd 8771 cyrus 18uR REG 249,0 0 202944402
>> /var/lib/imap/lock/user/smilovsky.lock
>> lmtpd 8834 cyrus 18uR REG 249,0 0 202944402
>> /var/lib/imap/lock/user/smilovsky.lock
>> lmtpd 9123 cyrus 18uR REG 249,0 0 202944402
>> /var/lib/imap/lock/user/smilovsky.lock
>> lmtpd 9631 cyrus 18uR REG 249,0 0 202944402
>> /var/lib/imap/lock/user/smilovsky.lock
>> lmtpd 10343 cyrus 18uR REG 249,0 0 202944402
>> /var/lib/imap/lock/user/smilovsky.lock
>> lmtpd 10437 cyrus 18uR REG 249,0 0 202944402
>> /var/lib/imap/lock/user/smilovsky.lock
>> lmtpd 11411 cyrus 18uR REG 249,0 0 202944402
>> /var/lib/imap/lock/user/smilovsky.lock
>> lmtpd 11413 cyrus 18uR REG 249,0 0 202944402
>> /var/lib/imap/lock/user/smilovsky.lock
>> lmtpd 11506 cyrus 18uR REG 249,0 0 202944402
>> /var/lib/imap/lock/user/smilovsky.lock
>>
>> [root at hippo4 bin]# lsof /var/lib/imap/lock/user/smilovsky.lock | grep
>> '^imapd'
>> imapd 9132 cyrus 19uR REG 249,0 0 202944402
>> /var/lib/imap/lock/user/smilovsky.lock
>> imapd 9154 cyrus 19uR REG 249,0 0 202944402
>> /var/lib/imap/lock/user/smilovsky.lock
>> imapd 10311 cyrus 19uR REG 249,0 0 202944402
>> /var/lib/imap/lock/user/smilovsky.lock
>> imapd 10746 cyrus 19uR REG 249,0 0 202944402
>> /var/lib/imap/lock/user/smilovsky.lock
>> imapd 11843 cyrus 19uR REG 249,0 0 202944402
>> /var/lib/imap/lock/user/smilovsky.lock
>> imapd 12059 cyrus 19uR REG 249,0 0 202944402
>> /var/lib/imap/lock/user/smilovsky.lock
>> imapd 21885 cyrus 19uR REG 249,0 0 202944402
>> /var/lib/imap/lock/user/smilovsky.lock
>>
>> In this case the root cause imapd process is 21885. strace shows that
>> the process waits on a futex forever:
>>
>> [root at hippo4 bin]# strace -tt -f -p 21885
>> strace: Process 21885 attached
>> 13:07:28.264713 futex(0x7f6dcc5ac760, FUTEX_WAIT_PRIVATE, 2,
>> NULL^Cstrace: Process 21885 detached
>> <detached ...>
>>
>> Killing the process frees the lock, all the other stuck processes runs
>> and the problem disappears in a few seconds.
>>
>> Investigation shows that this process has a long-term connection from
>> a user MUA (Thunderbird) and deletes mail messages time-to-time. This
>> may be a Thunderbird 'filter' thread
>>
>> [root at hippo4 bin]# grep 'imap\[21885\]' /var/log/maillog
>> [...]
>> Dec 13 09:59:11 hippo4 imap[21885]: starttls: TLSv1.2 with cipher
>> DHE-RSA-AES256-SHA (256/256 bits new) no authentication
>> Dec 13 09:59:11 hippo4 imap[21885]: login: thunder.solvo.ru
>> [172.16.85.69] smilovsky CRAM-MD5+TLS User logged in
>> SESSIONID=<hippo.solvo.ru-21885-1544684350-1>
>> Dec 13 09:59:11 hippo4 imap[21885]: client id: "name" "Thunderbird"
>> "version" "52.5.2"
>> Dec 13 09:59:11 hippo4 imap[21885]: Expunged 1 messages from
>> user.smilovsky
>> Dec 13 09:59:12 hippo4 imap[21885]: Expunged 1 messages from
>> user.smilovsky
>> Dec 13 09:59:12 hippo4 imap[21885]: Expunged 1 messages from
>> user.smilovsky
>> Dec 13 09:59:12 hippo4 imap[21885]: Expunged 1 messages from
>> user.smilovsky
>> Dec 13 10:01:45 hippo4 imap[21885]: Expunged 1 messages from
>> user.smilovsky
>> Dec 13 10:01:53 hippo4 imap[21885]: Expunged 1 messages from
>> user.smilovsky
>> Dec 13 10:03:17 hippo4 imap[21885]: Expunged 1 messages from
>> user.smilovsky
>> Dec 13 10:09:06 hippo4 imap[21885]: Expunged 1 messages from
>> user.smilovsky
>> Dec 13 10:09:19 hippo4 imap[21885]: Expunged 1 messages from
>> user.smilovsky
>>
>> Problem is sporadic, I didn't see any correlation with server load,
>> time of day, user name and so on
>>
>> Removing "maxchild=" parameter for lmtp just delay the final as the
>> server system resources are limited. Now I wrote a patrol script
>> counting locks and killing imapd if locks number is growing up. The
>> trouble is I can't reliably determine the root cause imapd and kill
>> them one-by-one
>>
>> Any ideas more?
>>
>> Ivan Kuznetsov, SOLVO ltd
>> ----
>> Cyrus Home Page: http://www.cyrusimap.org/
>> List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
>> To Unsubscribe:
>> https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
>
--
С уважением, Иван Кузнецов
Руководитель технического отдела
Компания "СОЛВО"
+7(812)60-60-555
+7(495)66-83-003
+7(921)740-72-61
http://www.solvo.ru
More information about the Info-cyrus
mailing list