Cyrus-imapd 2.4.17: processes stick on mailbox locking resulting in total mailsystem failure

Ivan Kuznetsov kia at solvo.ru
Thu Dec 13 06:21:04 EST 2018


Hello

We had a company mail server under Oracle Linux 6 (a clone of RHEL6) 
with cyrus-imapd 2.3.16 working for years w/o problems. There are >9M 
messages in ~9500 mailboxes in two partitions. Daily mail traffic is 
20-50K messages.

Some weeks ago we migrated the server to a new one under Oracle Linux 7 
with cyrus-imapd 2.4.17 (cyrus-imapd-2.4.17-13.el7.x86_64 rpm) and now 
have problem. Some times a week one of imapd processes locks an "INBOX" 
mailbox with corresponding  /var/lib/imap/lock/user/<user>.lock file and 
does not unlock it anymore. Other imapd processes trying to access this 
mailbox sticks waiting to obtain the lock. The bigger problem is that 
lmtpd processes trying to deliver new mail to this mailbox hangs too. 
The number of lmtpd processes is limited (maxchild=32) to limit the 
server load, so free lmtpd pool become empty after a time, and mail 
delivery stopsto all the mailboxes. MTA (postfix) mail queue blows up 
quickly

Example lsof output:

lmtpd   8182 cyrus   18uR  REG              249,0         0 202944402 
/var/lib/imap/lock/user/smilovsky.lock
lmtpd   8183 cyrus   18uR  REG              249,0         0 202944402 
/var/lib/imap/lock/user/smilovsky.lock
lmtpd   8187 cyrus   18uR  REG              249,0         0 202944402 
/var/lib/imap/lock/user/smilovsky.lock
lmtpd   8771 cyrus   18uR  REG              249,0         0 202944402 
/var/lib/imap/lock/user/smilovsky.lock
lmtpd   8834 cyrus   18uR  REG              249,0         0 202944402 
/var/lib/imap/lock/user/smilovsky.lock
lmtpd   9123 cyrus   18uR  REG              249,0         0 202944402 
/var/lib/imap/lock/user/smilovsky.lock
lmtpd   9631 cyrus   18uR  REG              249,0         0 202944402 
/var/lib/imap/lock/user/smilovsky.lock
lmtpd   10343 cyrus   18uR  REG              249,0         0 202944402 
/var/lib/imap/lock/user/smilovsky.lock
lmtpd   10437 cyrus   18uR  REG              249,0         0 202944402 
/var/lib/imap/lock/user/smilovsky.lock
lmtpd   11411 cyrus   18uR  REG              249,0         0 202944402 
/var/lib/imap/lock/user/smilovsky.lock
lmtpd   11413 cyrus   18uR  REG              249,0         0 202944402 
/var/lib/imap/lock/user/smilovsky.lock
lmtpd   11506 cyrus   18uR  REG              249,0         0 202944402 
/var/lib/imap/lock/user/smilovsky.lock

[root at hippo4 bin]# lsof /var/lib/imap/lock/user/smilovsky.lock | grep 
'^imapd'
imapd    9132 cyrus   19uR  REG  249,0        0 202944402 
/var/lib/imap/lock/user/smilovsky.lock
imapd    9154 cyrus   19uR  REG  249,0        0 202944402 
/var/lib/imap/lock/user/smilovsky.lock
imapd   10311 cyrus   19uR  REG  249,0        0 202944402 
/var/lib/imap/lock/user/smilovsky.lock
imapd   10746 cyrus   19uR  REG  249,0        0 202944402 
/var/lib/imap/lock/user/smilovsky.lock
imapd   11843 cyrus   19uR  REG  249,0        0 202944402 
/var/lib/imap/lock/user/smilovsky.lock
imapd   12059 cyrus   19uR  REG  249,0        0 202944402 
/var/lib/imap/lock/user/smilovsky.lock
imapd   21885 cyrus   19uR  REG  249,0        0 202944402 
/var/lib/imap/lock/user/smilovsky.lock

In this case the root cause imapd process is 21885. strace shows that 
the process waits on a futex forever:

[root at hippo4 bin]# strace -tt -f -p 21885
strace: Process 21885 attached
13:07:28.264713 futex(0x7f6dcc5ac760, FUTEX_WAIT_PRIVATE, 2, 
NULL^Cstrace: Process 21885 detached
  <detached ...>

Killing the process frees the lock, all the other stuck processes runs 
and the problem disappears in a few seconds.

Investigation shows that this process has a long-term connection from a 
user MUA (Thunderbird) and deletes mail messages time-to-time. This may 
be a Thunderbird 'filter' thread

[root at hippo4 bin]# grep 'imap\[21885\]' /var/log/maillog
[...]
Dec 13 09:59:11 hippo4 imap[21885]: starttls: TLSv1.2 with cipher 
DHE-RSA-AES256-SHA (256/256 bits new) no authentication
Dec 13 09:59:11 hippo4 imap[21885]: login: thunder.solvo.ru 
[172.16.85.69] smilovsky CRAM-MD5+TLS User logged in 
SESSIONID=<hippo.solvo.ru-21885-1544684350-1>
Dec 13 09:59:11 hippo4 imap[21885]: client id: "name" "Thunderbird" 
"version" "52.5.2"
Dec 13 09:59:11 hippo4 imap[21885]: Expunged 1 messages from user.smilovsky
Dec 13 09:59:12 hippo4 imap[21885]: Expunged 1 messages from user.smilovsky
Dec 13 09:59:12 hippo4 imap[21885]: Expunged 1 messages from user.smilovsky
Dec 13 09:59:12 hippo4 imap[21885]: Expunged 1 messages from user.smilovsky
Dec 13 10:01:45 hippo4 imap[21885]: Expunged 1 messages from user.smilovsky
Dec 13 10:01:53 hippo4 imap[21885]: Expunged 1 messages from user.smilovsky
Dec 13 10:03:17 hippo4 imap[21885]: Expunged 1 messages from user.smilovsky
Dec 13 10:09:06 hippo4 imap[21885]: Expunged 1 messages from user.smilovsky
Dec 13 10:09:19 hippo4 imap[21885]: Expunged 1 messages from user.smilovsky

Problem is sporadic, I didn't see any correlation with server load, time 
of day, user name and so on

Removing "maxchild=" parameter for lmtp just delay the final as the 
server system resources are limited. Now I wrote a patrol script 
counting locks and killing imapd if locks number is growing up. The 
trouble is I can't reliably determine the root cause imapd and kill them 
one-by-one

Any ideas more?

Ivan Kuznetsov, SOLVO ltd


More information about the Info-cyrus mailing list