Cyrus-imapd 2.4.17: processes stick on mailbox locking resulting in total mailsystem failure

Ivan Kuznetsov kia at solvo.ru
Fri Feb 22 10:06:03 EST 2019


Hi Ellie

Thanks a lot, I will try to build and test 2.4.20

22.02.2019 04:20, ellie timoney пишет:
> Hi Ivan,
> 
>> #7  0x000056193ed6382e in idle_update ()
> 
>> with cyrus-imapd 2.4.17 (cyrus-imapd-2.4.17-13.el7.x86_64 rpm)
> 
> It looks like this version is old enough to still be using signal handlers in its IMAP IDLE implementation.  This is known to be a problem, and IDLE was rewritten to not use signals for 2.5 and later.
> 
> Thomas Jarosch kindly backported the rewrite for 2.4, and it has been included in releases since 2.4.19 (the current 2.4 release is 2.4.20)
> 
> Cheers,
> 
> ellie
> 
> On Wed, Feb 20, 2019, at 1:15 AM, Ivan Kuznetsov wrote:
>> Hello
>>
>> I have gdb'ed the locked process, backtrace is below. It seems that
>> problem occurs when imapd process catch a signal when is inside malloc()
>> call. The signal handler has a malloc() call too, so finally there is
>> interlock between two mallocs
>>
>> I have only a few time to investigate because locked processes list
>> grows up dramatically. So I didn't found what the signal was. But it
>> seems that there is a bug in imapd code...
>>
>> (gdb) thread apply all bt
>>
>> Thread 1 (Thread 0x7ff6ead5f840 (LWP 22980)):
>> #0  0x00007ff6e83e282c in __lll_lock_wait_private () from /lib64/libc.so.6
>> #1  0x00007ff6e835e7e5 in _L_lock_16773 () from /lib64/libc.so.6
>> #2  0x00007ff6e835b833 in malloc () from /lib64/libc.so.6
>> #3  0x000056193edc9910 in xzmalloc ()
>> #4  0x000056193edb2664 in seqset_init ()
>> #5  0x000056193ed821b0 in index_tellchanges ()
>> #6  0x000056193ed8232b in index_check ()
>> #7  0x000056193ed6382e in idle_update ()
>> #8  <signal handler called>
>> #9  0x00007ff6e83580f0 in _int_malloc () from /lib64/libc.so.6
>> #10 0x00007ff6e835b7dc in malloc () from /lib64/libc.so.6
>> #11 0x00007ff6e956f4b8 in CRYPTO_malloc () from /lib64/libcrypto.so.10
>> #12 0x00007ff6e96397ec in EVP_PKEY_CTX_dup () from /lib64/libcrypto.so.10
>> #13 0x00007ff6e962b360 in EVP_MD_CTX_copy_ex () from /lib64/libcrypto.so.10
>> #14 0x00007ff6e999ab3d in tls1_mac () from /lib64/libssl.so.10
>> #15 0x00007ff6e998db02 in ssl3_read_bytes () from /lib64/libssl.so.10
>> #16 0x00007ff6e998a644 in ssl3_read_internal () from /lib64/libssl.so.10
>> #17 0x000056193edb9745 in prot_fill ()
>> #18 0x000056193eda9ad7 in getword ()
>> #19 0x000056193ed670e7 in cmd_idle ()
>> #20 0x000056193ed7848d in cmdloop ()
>> #21 0x000056193ed79769 in service_main ()
>> #22 0x000056193ed62875 in main ()
>>
>>
>> 13.12.2018 17:50, Ivan Kuznetsov пишет:
>>> Jonk, thank you for the idea. Somewhat looks strange as old mail server
>>> worked w/o this problem 5+ years. But the system environment changed
>>> dramatically, may be some filesystem quircks are significant for this
>>> locks...
>>>
>>> I will try gdb'ing the process when problem occurs once more
>>>
>>> 13.12.2018 17:34, John Wade пишет:
>>>> Without running gdb on the process, I have no idea, but your problem
>>>> sounds similar to something we hit a very long time ago:
>>>>
>>>> See https://www.oakton.edu/user/3/jwade/cyrus/Readme.html
>>>>
>>>> In our cases, the problem was the imapd process that was holding the
>>>> lock was trying to obtain a second lock on the same file.   What does
>>>> a stack trace look like on the imapd process that is holding the lock?
>>>> It would appear the lock process has changed since I last looked at
>>>> this, so this may not be a help at all.
>>>>
>>>> Good luck,
>>>> John
>>>>
>>>>
>>>>
>>>> On 12/13/2018 5:21 AM, Ivan Kuznetsov wrote:
>>>>> Hello
>>>>>
>>>>> We had a company mail server under Oracle Linux 6 (a clone of RHEL6)
>>>>> with cyrus-imapd 2.3.16 working for years w/o problems. There are >9M
>>>>> messages in ~9500 mailboxes in two partitions. Daily mail traffic is
>>>>> 20-50K messages.
>>>>>
>>>>> Some weeks ago we migrated the server to a new one under Oracle Linux
>>>>> 7 with cyrus-imapd 2.4.17 (cyrus-imapd-2.4.17-13.el7.x86_64 rpm) and
>>>>> now have problem. Some times a week one of imapd processes locks an
>>>>> "INBOX" mailbox with corresponding
>>>>> /var/lib/imap/lock/user/<user>.lock file and does not unlock it
>>>>> anymore. Other imapd processes trying to access this mailbox sticks
>>>>> waiting to obtain the lock. The bigger problem is that lmtpd
>>>>> processes trying to deliver new mail to this mailbox hangs too. The
>>>>> number of lmtpd processes is limited (maxchild=32) to limit the
>>>>> server load, so free lmtpd pool become empty after a time, and mail
>>>>> delivery stopsto all the mailboxes. MTA (postfix) mail queue blows up
>>>>> quickly
>>>>>
>>>>> Example lsof output:
>>>>>
>>>>> lmtpd   8182 cyrus   18uR  REG              249,0         0 202944402
>>>>> /var/lib/imap/lock/user/smilovsky.lock
>>>>> lmtpd   8183 cyrus   18uR  REG              249,0         0 202944402
>>>>> /var/lib/imap/lock/user/smilovsky.lock
>>>>> lmtpd   8187 cyrus   18uR  REG              249,0         0 202944402
>>>>> /var/lib/imap/lock/user/smilovsky.lock
>>>>> lmtpd   8771 cyrus   18uR  REG              249,0         0 202944402
>>>>> /var/lib/imap/lock/user/smilovsky.lock
>>>>> lmtpd   8834 cyrus   18uR  REG              249,0         0 202944402
>>>>> /var/lib/imap/lock/user/smilovsky.lock
>>>>> lmtpd   9123 cyrus   18uR  REG              249,0         0 202944402
>>>>> /var/lib/imap/lock/user/smilovsky.lock
>>>>> lmtpd   9631 cyrus   18uR  REG              249,0         0 202944402
>>>>> /var/lib/imap/lock/user/smilovsky.lock
>>>>> lmtpd   10343 cyrus   18uR  REG              249,0         0
>>>>> 202944402 /var/lib/imap/lock/user/smilovsky.lock
>>>>> lmtpd   10437 cyrus   18uR  REG              249,0         0
>>>>> 202944402 /var/lib/imap/lock/user/smilovsky.lock
>>>>> lmtpd   11411 cyrus   18uR  REG              249,0         0
>>>>> 202944402 /var/lib/imap/lock/user/smilovsky.lock
>>>>> lmtpd   11413 cyrus   18uR  REG              249,0         0
>>>>> 202944402 /var/lib/imap/lock/user/smilovsky.lock
>>>>> lmtpd   11506 cyrus   18uR  REG              249,0         0
>>>>> 202944402 /var/lib/imap/lock/user/smilovsky.lock
>>>>>
>>>>> [root at hippo4 bin]# lsof /var/lib/imap/lock/user/smilovsky.lock | grep
>>>>> '^imapd'
>>>>> imapd    9132 cyrus   19uR  REG  249,0        0 202944402
>>>>> /var/lib/imap/lock/user/smilovsky.lock
>>>>> imapd    9154 cyrus   19uR  REG  249,0        0 202944402
>>>>> /var/lib/imap/lock/user/smilovsky.lock
>>>>> imapd   10311 cyrus   19uR  REG  249,0        0 202944402
>>>>> /var/lib/imap/lock/user/smilovsky.lock
>>>>> imapd   10746 cyrus   19uR  REG  249,0        0 202944402
>>>>> /var/lib/imap/lock/user/smilovsky.lock
>>>>> imapd   11843 cyrus   19uR  REG  249,0        0 202944402
>>>>> /var/lib/imap/lock/user/smilovsky.lock
>>>>> imapd   12059 cyrus   19uR  REG  249,0        0 202944402
>>>>> /var/lib/imap/lock/user/smilovsky.lock
>>>>> imapd   21885 cyrus   19uR  REG  249,0        0 202944402
>>>>> /var/lib/imap/lock/user/smilovsky.lock
>>>>>
>>>>> In this case the root cause imapd process is 21885. strace shows that
>>>>> the process waits on a futex forever:
>>>>>
>>>>> [root at hippo4 bin]# strace -tt -f -p 21885
>>>>> strace: Process 21885 attached
>>>>> 13:07:28.264713 futex(0x7f6dcc5ac760, FUTEX_WAIT_PRIVATE, 2,
>>>>> NULL^Cstrace: Process 21885 detached
>>>>>   <detached ...>
>>>>>
>>>>> Killing the process frees the lock, all the other stuck processes
>>>>> runs and the problem disappears in a few seconds.
>>>>>
>>>>> Investigation shows that this process has a long-term connection from
>>>>> a user MUA (Thunderbird) and deletes mail messages time-to-time. This
>>>>> may be a Thunderbird 'filter' thread
>>>>>
>>>>> [root at hippo4 bin]# grep 'imap\[21885\]' /var/log/maillog
>>>>> [...]
>>>>> Dec 13 09:59:11 hippo4 imap[21885]: starttls: TLSv1.2 with cipher
>>>>> DHE-RSA-AES256-SHA (256/256 bits new) no authentication
>>>>> Dec 13 09:59:11 hippo4 imap[21885]: login: thunder.solvo.ru
>>>>> [172.16.85.69] smilovsky CRAM-MD5+TLS User logged in
>>>>> SESSIONID=<hippo.solvo.ru-21885-1544684350-1>
>>>>> Dec 13 09:59:11 hippo4 imap[21885]: client id: "name" "Thunderbird"
>>>>> "version" "52.5.2"
>>>>> Dec 13 09:59:11 hippo4 imap[21885]: Expunged 1 messages from
>>>>> user.smilovsky
>>>>> Dec 13 09:59:12 hippo4 imap[21885]: Expunged 1 messages from
>>>>> user.smilovsky
>>>>> Dec 13 09:59:12 hippo4 imap[21885]: Expunged 1 messages from
>>>>> user.smilovsky
>>>>> Dec 13 09:59:12 hippo4 imap[21885]: Expunged 1 messages from
>>>>> user.smilovsky
>>>>> Dec 13 10:01:45 hippo4 imap[21885]: Expunged 1 messages from
>>>>> user.smilovsky
>>>>> Dec 13 10:01:53 hippo4 imap[21885]: Expunged 1 messages from
>>>>> user.smilovsky
>>>>> Dec 13 10:03:17 hippo4 imap[21885]: Expunged 1 messages from
>>>>> user.smilovsky
>>>>> Dec 13 10:09:06 hippo4 imap[21885]: Expunged 1 messages from
>>>>> user.smilovsky
>>>>> Dec 13 10:09:19 hippo4 imap[21885]: Expunged 1 messages from
>>>>> user.smilovsky
>>>>>
>>>>> Problem is sporadic, I didn't see any correlation with server load,
>>>>> time of day, user name and so on
>>>>>
>>>>> Removing "maxchild=" parameter for lmtp just delay the final as the
>>>>> server system resources are limited. Now I wrote a patrol script
>>>>> counting locks and killing imapd if locks number is growing up. The
>>>>> trouble is I can't reliably determine the root cause imapd and kill
>>>>> them one-by-one
>>>>>
>>>>> Any ideas more?
>>>>>
>>>>> Ivan Kuznetsov, SOLVO ltd
>>>>> ----
>>>>> Cyrus Home Page: http://www.cyrusimap.org/
>>>>> List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
>>>>> To Unsubscribe:
>>>>> https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
>>>>
>>>
>>
>> -- 
>> С уважением, Иван Кузнецов
>> Руководитель технического отдела
>>
>> Компания "СОЛВО"
>> +7(812)60-60-555
>> +7(495)66-83-003
>> +7(921)740-72-61
>> http://www.solvo.ru
>> ----
>> Cyrus Home Page: http://www.cyrusimap.org/
>> List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
>> To Unsubscribe:
>> https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
> ----
> Cyrus Home Page: http://www.cyrusimap.org/
> List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
> To Unsubscribe:
> https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
> 

-- 
С уважением, Иван Кузнецов
Руководитель технического отдела

Компания "СОЛВО"
+7(812)60-60-555
+7(495)66-83-003
+7(921)740-72-61
http://www.solvo.ru


More information about the Info-cyrus mailing list