Cyrus-imapd 2.4.17: processes stick on mailbox locking resulting in total mailsystem failure
Ivan Kuznetsov
kia at solvo.ru
Thu Dec 13 06:21:04 EST 2018
Hello
We had a company mail server under Oracle Linux 6 (a clone of RHEL6)
with cyrus-imapd 2.3.16 working for years w/o problems. There are >9M
messages in ~9500 mailboxes in two partitions. Daily mail traffic is
20-50K messages.
Some weeks ago we migrated the server to a new one under Oracle Linux 7
with cyrus-imapd 2.4.17 (cyrus-imapd-2.4.17-13.el7.x86_64 rpm) and now
have problem. Some times a week one of imapd processes locks an "INBOX"
mailbox with corresponding /var/lib/imap/lock/user/<user>.lock file and
does not unlock it anymore. Other imapd processes trying to access this
mailbox sticks waiting to obtain the lock. The bigger problem is that
lmtpd processes trying to deliver new mail to this mailbox hangs too.
The number of lmtpd processes is limited (maxchild=32) to limit the
server load, so free lmtpd pool become empty after a time, and mail
delivery stopsto all the mailboxes. MTA (postfix) mail queue blows up
quickly
Example lsof output:
lmtpd 8182 cyrus 18uR REG 249,0 0 202944402
/var/lib/imap/lock/user/smilovsky.lock
lmtpd 8183 cyrus 18uR REG 249,0 0 202944402
/var/lib/imap/lock/user/smilovsky.lock
lmtpd 8187 cyrus 18uR REG 249,0 0 202944402
/var/lib/imap/lock/user/smilovsky.lock
lmtpd 8771 cyrus 18uR REG 249,0 0 202944402
/var/lib/imap/lock/user/smilovsky.lock
lmtpd 8834 cyrus 18uR REG 249,0 0 202944402
/var/lib/imap/lock/user/smilovsky.lock
lmtpd 9123 cyrus 18uR REG 249,0 0 202944402
/var/lib/imap/lock/user/smilovsky.lock
lmtpd 9631 cyrus 18uR REG 249,0 0 202944402
/var/lib/imap/lock/user/smilovsky.lock
lmtpd 10343 cyrus 18uR REG 249,0 0 202944402
/var/lib/imap/lock/user/smilovsky.lock
lmtpd 10437 cyrus 18uR REG 249,0 0 202944402
/var/lib/imap/lock/user/smilovsky.lock
lmtpd 11411 cyrus 18uR REG 249,0 0 202944402
/var/lib/imap/lock/user/smilovsky.lock
lmtpd 11413 cyrus 18uR REG 249,0 0 202944402
/var/lib/imap/lock/user/smilovsky.lock
lmtpd 11506 cyrus 18uR REG 249,0 0 202944402
/var/lib/imap/lock/user/smilovsky.lock
[root at hippo4 bin]# lsof /var/lib/imap/lock/user/smilovsky.lock | grep
'^imapd'
imapd 9132 cyrus 19uR REG 249,0 0 202944402
/var/lib/imap/lock/user/smilovsky.lock
imapd 9154 cyrus 19uR REG 249,0 0 202944402
/var/lib/imap/lock/user/smilovsky.lock
imapd 10311 cyrus 19uR REG 249,0 0 202944402
/var/lib/imap/lock/user/smilovsky.lock
imapd 10746 cyrus 19uR REG 249,0 0 202944402
/var/lib/imap/lock/user/smilovsky.lock
imapd 11843 cyrus 19uR REG 249,0 0 202944402
/var/lib/imap/lock/user/smilovsky.lock
imapd 12059 cyrus 19uR REG 249,0 0 202944402
/var/lib/imap/lock/user/smilovsky.lock
imapd 21885 cyrus 19uR REG 249,0 0 202944402
/var/lib/imap/lock/user/smilovsky.lock
In this case the root cause imapd process is 21885. strace shows that
the process waits on a futex forever:
[root at hippo4 bin]# strace -tt -f -p 21885
strace: Process 21885 attached
13:07:28.264713 futex(0x7f6dcc5ac760, FUTEX_WAIT_PRIVATE, 2,
NULL^Cstrace: Process 21885 detached
<detached ...>
Killing the process frees the lock, all the other stuck processes runs
and the problem disappears in a few seconds.
Investigation shows that this process has a long-term connection from a
user MUA (Thunderbird) and deletes mail messages time-to-time. This may
be a Thunderbird 'filter' thread
[root at hippo4 bin]# grep 'imap\[21885\]' /var/log/maillog
[...]
Dec 13 09:59:11 hippo4 imap[21885]: starttls: TLSv1.2 with cipher
DHE-RSA-AES256-SHA (256/256 bits new) no authentication
Dec 13 09:59:11 hippo4 imap[21885]: login: thunder.solvo.ru
[172.16.85.69] smilovsky CRAM-MD5+TLS User logged in
SESSIONID=<hippo.solvo.ru-21885-1544684350-1>
Dec 13 09:59:11 hippo4 imap[21885]: client id: "name" "Thunderbird"
"version" "52.5.2"
Dec 13 09:59:11 hippo4 imap[21885]: Expunged 1 messages from user.smilovsky
Dec 13 09:59:12 hippo4 imap[21885]: Expunged 1 messages from user.smilovsky
Dec 13 09:59:12 hippo4 imap[21885]: Expunged 1 messages from user.smilovsky
Dec 13 09:59:12 hippo4 imap[21885]: Expunged 1 messages from user.smilovsky
Dec 13 10:01:45 hippo4 imap[21885]: Expunged 1 messages from user.smilovsky
Dec 13 10:01:53 hippo4 imap[21885]: Expunged 1 messages from user.smilovsky
Dec 13 10:03:17 hippo4 imap[21885]: Expunged 1 messages from user.smilovsky
Dec 13 10:09:06 hippo4 imap[21885]: Expunged 1 messages from user.smilovsky
Dec 13 10:09:19 hippo4 imap[21885]: Expunged 1 messages from user.smilovsky
Problem is sporadic, I didn't see any correlation with server load, time
of day, user name and so on
Removing "maxchild=" parameter for lmtp just delay the final as the
server system resources are limited. Now I wrote a patrol script
counting locks and killing imapd if locks number is growing up. The
trouble is I can't reliably determine the root cause imapd and kill them
one-by-one
Any ideas more?
Ivan Kuznetsov, SOLVO ltd
More information about the Info-cyrus
mailing list