Cyrus-imapd 2.4.17: processes stick on mailbox locking resulting in total mailsystem failure

Ivan Kuznetsov kia at solvo.ru
Tue Feb 19 09:10:53 EST 2019


Hello

I have gdb'ed the locked process, backtrace is below. It seems that 
problem occurs when imapd process catch a signal when is inside malloc() 
call. The signal handler has a malloc() call too, so finally there is 
interlock between two mallocs

I have only a few time to investigate because locked processes list 
grows up dramatically. So I didn't found what the signal was. But it 
seems that there is a bug in imapd code...

(gdb) thread apply all bt

Thread 1 (Thread 0x7ff6ead5f840 (LWP 22980)):
#0  0x00007ff6e83e282c in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00007ff6e835e7e5 in _L_lock_16773 () from /lib64/libc.so.6
#2  0x00007ff6e835b833 in malloc () from /lib64/libc.so.6
#3  0x000056193edc9910 in xzmalloc ()
#4  0x000056193edb2664 in seqset_init ()
#5  0x000056193ed821b0 in index_tellchanges ()
#6  0x000056193ed8232b in index_check ()
#7  0x000056193ed6382e in idle_update ()
#8  <signal handler called>
#9  0x00007ff6e83580f0 in _int_malloc () from /lib64/libc.so.6
#10 0x00007ff6e835b7dc in malloc () from /lib64/libc.so.6
#11 0x00007ff6e956f4b8 in CRYPTO_malloc () from /lib64/libcrypto.so.10
#12 0x00007ff6e96397ec in EVP_PKEY_CTX_dup () from /lib64/libcrypto.so.10
#13 0x00007ff6e962b360 in EVP_MD_CTX_copy_ex () from /lib64/libcrypto.so.10
#14 0x00007ff6e999ab3d in tls1_mac () from /lib64/libssl.so.10
#15 0x00007ff6e998db02 in ssl3_read_bytes () from /lib64/libssl.so.10
#16 0x00007ff6e998a644 in ssl3_read_internal () from /lib64/libssl.so.10
#17 0x000056193edb9745 in prot_fill ()
#18 0x000056193eda9ad7 in getword ()
#19 0x000056193ed670e7 in cmd_idle ()
#20 0x000056193ed7848d in cmdloop ()
#21 0x000056193ed79769 in service_main ()
#22 0x000056193ed62875 in main ()


13.12.2018 17:50, Ivan Kuznetsov пишет:
> Jonk, thank you for the idea. Somewhat looks strange as old mail server 
> worked w/o this problem 5+ years. But the system environment changed 
> dramatically, may be some filesystem quircks are significant for this 
> locks...
> 
> I will try gdb'ing the process when problem occurs once more
> 
> 13.12.2018 17:34, John Wade пишет:
>> Without running gdb on the process, I have no idea, but your problem 
>> sounds similar to something we hit a very long time ago:
>>
>> See https://www.oakton.edu/user/3/jwade/cyrus/Readme.html
>>
>> In our cases, the problem was the imapd process that was holding the 
>> lock was trying to obtain a second lock on the same file.   What does 
>> a stack trace look like on the imapd process that is holding the lock? 
>> It would appear the lock process has changed since I last looked at 
>> this, so this may not be a help at all.
>>
>> Good luck,
>> John
>>
>>
>>
>> On 12/13/2018 5:21 AM, Ivan Kuznetsov wrote:
>>> Hello
>>>
>>> We had a company mail server under Oracle Linux 6 (a clone of RHEL6) 
>>> with cyrus-imapd 2.3.16 working for years w/o problems. There are >9M 
>>> messages in ~9500 mailboxes in two partitions. Daily mail traffic is 
>>> 20-50K messages.
>>>
>>> Some weeks ago we migrated the server to a new one under Oracle Linux 
>>> 7 with cyrus-imapd 2.4.17 (cyrus-imapd-2.4.17-13.el7.x86_64 rpm) and 
>>> now have problem. Some times a week one of imapd processes locks an 
>>> "INBOX" mailbox with corresponding 
>>> /var/lib/imap/lock/user/<user>.lock file and does not unlock it 
>>> anymore. Other imapd processes trying to access this mailbox sticks 
>>> waiting to obtain the lock. The bigger problem is that lmtpd 
>>> processes trying to deliver new mail to this mailbox hangs too. The 
>>> number of lmtpd processes is limited (maxchild=32) to limit the 
>>> server load, so free lmtpd pool become empty after a time, and mail 
>>> delivery stopsto all the mailboxes. MTA (postfix) mail queue blows up 
>>> quickly
>>>
>>> Example lsof output:
>>>
>>> lmtpd   8182 cyrus   18uR  REG              249,0         0 202944402 
>>> /var/lib/imap/lock/user/smilovsky.lock
>>> lmtpd   8183 cyrus   18uR  REG              249,0         0 202944402 
>>> /var/lib/imap/lock/user/smilovsky.lock
>>> lmtpd   8187 cyrus   18uR  REG              249,0         0 202944402 
>>> /var/lib/imap/lock/user/smilovsky.lock
>>> lmtpd   8771 cyrus   18uR  REG              249,0         0 202944402 
>>> /var/lib/imap/lock/user/smilovsky.lock
>>> lmtpd   8834 cyrus   18uR  REG              249,0         0 202944402 
>>> /var/lib/imap/lock/user/smilovsky.lock
>>> lmtpd   9123 cyrus   18uR  REG              249,0         0 202944402 
>>> /var/lib/imap/lock/user/smilovsky.lock
>>> lmtpd   9631 cyrus   18uR  REG              249,0         0 202944402 
>>> /var/lib/imap/lock/user/smilovsky.lock
>>> lmtpd   10343 cyrus   18uR  REG              249,0         0 
>>> 202944402 /var/lib/imap/lock/user/smilovsky.lock
>>> lmtpd   10437 cyrus   18uR  REG              249,0         0 
>>> 202944402 /var/lib/imap/lock/user/smilovsky.lock
>>> lmtpd   11411 cyrus   18uR  REG              249,0         0 
>>> 202944402 /var/lib/imap/lock/user/smilovsky.lock
>>> lmtpd   11413 cyrus   18uR  REG              249,0         0 
>>> 202944402 /var/lib/imap/lock/user/smilovsky.lock
>>> lmtpd   11506 cyrus   18uR  REG              249,0         0 
>>> 202944402 /var/lib/imap/lock/user/smilovsky.lock
>>>
>>> [root at hippo4 bin]# lsof /var/lib/imap/lock/user/smilovsky.lock | grep 
>>> '^imapd'
>>> imapd    9132 cyrus   19uR  REG  249,0        0 202944402 
>>> /var/lib/imap/lock/user/smilovsky.lock
>>> imapd    9154 cyrus   19uR  REG  249,0        0 202944402 
>>> /var/lib/imap/lock/user/smilovsky.lock
>>> imapd   10311 cyrus   19uR  REG  249,0        0 202944402 
>>> /var/lib/imap/lock/user/smilovsky.lock
>>> imapd   10746 cyrus   19uR  REG  249,0        0 202944402 
>>> /var/lib/imap/lock/user/smilovsky.lock
>>> imapd   11843 cyrus   19uR  REG  249,0        0 202944402 
>>> /var/lib/imap/lock/user/smilovsky.lock
>>> imapd   12059 cyrus   19uR  REG  249,0        0 202944402 
>>> /var/lib/imap/lock/user/smilovsky.lock
>>> imapd   21885 cyrus   19uR  REG  249,0        0 202944402 
>>> /var/lib/imap/lock/user/smilovsky.lock
>>>
>>> In this case the root cause imapd process is 21885. strace shows that 
>>> the process waits on a futex forever:
>>>
>>> [root at hippo4 bin]# strace -tt -f -p 21885
>>> strace: Process 21885 attached
>>> 13:07:28.264713 futex(0x7f6dcc5ac760, FUTEX_WAIT_PRIVATE, 2, 
>>> NULL^Cstrace: Process 21885 detached
>>>  <detached ...>
>>>
>>> Killing the process frees the lock, all the other stuck processes 
>>> runs and the problem disappears in a few seconds.
>>>
>>> Investigation shows that this process has a long-term connection from 
>>> a user MUA (Thunderbird) and deletes mail messages time-to-time. This 
>>> may be a Thunderbird 'filter' thread
>>>
>>> [root at hippo4 bin]# grep 'imap\[21885\]' /var/log/maillog
>>> [...]
>>> Dec 13 09:59:11 hippo4 imap[21885]: starttls: TLSv1.2 with cipher 
>>> DHE-RSA-AES256-SHA (256/256 bits new) no authentication
>>> Dec 13 09:59:11 hippo4 imap[21885]: login: thunder.solvo.ru 
>>> [172.16.85.69] smilovsky CRAM-MD5+TLS User logged in 
>>> SESSIONID=<hippo.solvo.ru-21885-1544684350-1>
>>> Dec 13 09:59:11 hippo4 imap[21885]: client id: "name" "Thunderbird" 
>>> "version" "52.5.2"
>>> Dec 13 09:59:11 hippo4 imap[21885]: Expunged 1 messages from 
>>> user.smilovsky
>>> Dec 13 09:59:12 hippo4 imap[21885]: Expunged 1 messages from 
>>> user.smilovsky
>>> Dec 13 09:59:12 hippo4 imap[21885]: Expunged 1 messages from 
>>> user.smilovsky
>>> Dec 13 09:59:12 hippo4 imap[21885]: Expunged 1 messages from 
>>> user.smilovsky
>>> Dec 13 10:01:45 hippo4 imap[21885]: Expunged 1 messages from 
>>> user.smilovsky
>>> Dec 13 10:01:53 hippo4 imap[21885]: Expunged 1 messages from 
>>> user.smilovsky
>>> Dec 13 10:03:17 hippo4 imap[21885]: Expunged 1 messages from 
>>> user.smilovsky
>>> Dec 13 10:09:06 hippo4 imap[21885]: Expunged 1 messages from 
>>> user.smilovsky
>>> Dec 13 10:09:19 hippo4 imap[21885]: Expunged 1 messages from 
>>> user.smilovsky
>>>
>>> Problem is sporadic, I didn't see any correlation with server load, 
>>> time of day, user name and so on
>>>
>>> Removing "maxchild=" parameter for lmtp just delay the final as the 
>>> server system resources are limited. Now I wrote a patrol script 
>>> counting locks and killing imapd if locks number is growing up. The 
>>> trouble is I can't reliably determine the root cause imapd and kill 
>>> them one-by-one
>>>
>>> Any ideas more?
>>>
>>> Ivan Kuznetsov, SOLVO ltd
>>> ----
>>> Cyrus Home Page: http://www.cyrusimap.org/
>>> List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
>>> To Unsubscribe:
>>> https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
>>
> 

-- 
С уважением, Иван Кузнецов
Руководитель технического отдела

Компания "СОЛВО"
+7(812)60-60-555
+7(495)66-83-003
+7(921)740-72-61
http://www.solvo.ru


More information about the Info-cyrus mailing list