cyrus 2.4 deadlock identified: SIGALRM race
thomas.jarosch at intra2net.com
Tue Sep 1 08:31:23 EDT 2015
thanks to the recent lock debugging tool and very good luck,
I was able to spot the mysterious cyrus 2.4 (and earlier) deadlock.
Here's the output from the lock debugger:
/usr/cyrus/bin/imapd (pid 3301) holding WRITE lock for /datastore/imap-mails/user/projects/cyrus.index
/usr/cyrus/bin/imapd (pid 21130) ++WAITING++ for WRITE lock on /datastore/imap-mails/user/projects/cyrus.index
/usr/cyrus/bin/imapd (pid 20536) ++WAITING++ for WRITE lock on /datastore/imap-mails/user/projects/cyrus.index
Backtrace of process 3301:
#0 0xb77c9428 in __kernel_vsyscall ()
#1 0xb735af91 in __lll_lock_wait_private () from /lib/libc.so.6
#2 0xb72c88fe in _L_lock_9705 () from /lib/libc.so.6
#3 0xb72c66f0 in malloc () from /lib/libc.so.6
#4 0x080b7557 in xzmalloc (size=32) at xmalloc.c:68
#5 0x080a27b6 in seqset_init (maxval=0, flags=1) at sequence.c:59
#6 0x0806d152 in index_tellexpunge (state=0x9421ca8) at index.c:2319
#7 index_tellchanges (state=0x9421ca8, canexpunge=1, printuid=0) at index.c:2370
#8 0x08071041 in index_check (state=0x9421ca8, usinguid=1, printuid=0) at index.c:682
#9 0x080515ae in idle_update (flags=(IDLE_MAILBOX | IDLE_ALERT)) at imapd.c:2833
#10 0x0809abc5 in idle_handler (sig=14) at idle.c:197
#11 <signal handler called>
#12 0xb72c52d4 in _int_malloc () from /lib/libc.so.6
#13 0xb72c66fa in malloc () from /lib/libc.so.6
#14 0xb74bb21c in ?? () from /usr/lib/libcrypto.so.1.0.0
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
Tadaaa! We are in a middle of a malloc() call, SIGALRM triggers
for imap idle and does another malloc() call that deadlocks.
-> never ever put complex code in signal handlers.
Only set a volatile flag and be done with it.
After I killed process 3301, all the other processes resumed operation as normal.
The good news: This specific deadlock shouldn't happen anymore in 2.5+
as the idle code was refactored a few years ago:
Author: Greg Banks <gnb at fastmail.fm>
Date: Fri Mar 23 17:27:32 2012 +1100
idle: don't use signals, use AF_UNIX dgrams
Communications back from idled to imapds are via a message sent on the
AF_UNIX socket. The IDLE command is implemented as a select() loop, and
there's absolutely nothing that needs to be done in signal handler
context. Best of all, no more unexpected delivery of SIGUSR1 or
SIGUSER2, assassinating innocent bystander processes.
@Ken: The keep_alive() function in httpd.c (CalDAV)
probably suffers from the same signal handler issue.
More information about the Cyrus-devel