possible self-deadlock in idle signal handler

Sat Mar 28 09:37:56 EDT 2009

We're experiencing some problems, particularly with a small number of 
users, which manifest themselves in the dreaded "one deadlocked, 
hundreds waiting" process logjam.  The keystone process appears to be 
an imapd deadlocked on itself in this manner (this is Solaris 9):

-> pstack 19090
19090:  imapd
  febc5994 lwp_park (0, 0, 0)
  febc206c slow_lock (fecc05a8, feba0000, 0, fecbc000, 14, 0) + 58
  fec46e70 malloc   (c, 0, 13d668, 13d66c, 28cc, 13d790) + 18
  00078ac0 xmalloc  (c, 13d790, 0, 0, 0, 0) + 4
  00074a64 lock_or_refresh (13d660, 1364b4, 107400, 0, 0, 0) + 10c
  00074d50 myfetch  (13d660, 1bbe58, 10, ffbfb25c, ffbfb254, 1364b4) + 44
  00060d74 seen_readit (1364a0, ffbfb2ec, ffbfb2e8, 1252bc, ffbfb2e4, 1) 
+ 60
  0003d0c4 index_checkseen (123a00, 0, 0, 603, 1e5a4c, 87fd0) + 4c
  0003e298 index_check (123a00, 0, 1, 125000, ffbfc370, 125000) + 234
  0002c574 idle_update (3, 0, 0, 0, 0, 0) + 24
  0005f7cc idle_handler (e, 0, ffbfcb20, 0, 0, 0) + 5c
  febc5bac __sighndlr (e, 0, ffbfcb20, 5f770, 0, 0) + c
  febbf804 call_user_handler (e, 0, ffbfcb20, 0, 0, 0) + 234
  febbf9b4 sigacthandler (e, 0, ffbfcb20, 8, 1bd7c0, 0) + 64
  --- called from signal handler with signal 14 (SIGALRM) ---
  fec470d4 _malloc_unlocked (64, 0, 0, fecbc000, 0, 0) + 240
  fec46e78 malloc   (64, ff0a07d0, a3, 1c4d0d, db, 6d) + 20
  fefc5820 default_malloc_ex (64, ff0b17b0, ca, ca, 0, ffe43088) + 20
  fefc61e4 CRYPTO_malloc (0, ff0b17b0, ca, 1bcff0, 1bcf78, 1bcf78) + 84
  ff036efc EVP_DigestInit_ex (ffbfd150, ff0dfbb0, 0, fffffff8, 0, 
ffbfd1fd) + 13c
  fefdabec HMAC_Init_ex (ffbfd13c, ffbfd150, ffbfd048, ff0dfbb0, 0, 0) + 
cc
  ff160b70 tls1_mac (1bea88, ffbfd288, 0, 20, 0, 1) + 90
  ff15cfa4 ssl3_read_bytes (1bea88, 17, ffbfd288, 8c, 1c4d03, 0) + 524
  ff15a9c4 ssl3_read (1bea88, 13aef0, 1000, 0, 378, 0) + 44
  ff16a30c SSL_read (0, 13aef0, 1000, 0, ffbfd5bc, ffbfd5b1) + 6c
  0006bd5c prot_fill (13ae78, 0, 0, 0, ffbfd5bc, ffbfd428) + ec
  0005e564 getword  (13ae78, 125108, 1, 1a9e0, 2c8dc, 125000) + ac
  0002c8f0 cmd_idle (13d358, 7dc00, 0, 0, 730061, 0) + 2e8
  0002ea6c cmdloop  (0, 1360d8, 8bc60, 8bc60, 123c00, 125000) + df0
  00030d34 service_main (123c00, 132080, ffbffc2c, 0, 1aa50, 11a800) + 
180
  0001aaf8 main     (ffbff2b4, 7c000, fa, 27667, 2602e4, 49c71400) + 640
  0001a2ec _start   (0, 0, 0, 0, 0, 0) + 5c

 From looking online, what looks to be the problem is that the SSL stack 
was in the middle of a malloc() call when the SIGALRM went off, causing 
the process to try to open the seen file, which resulted in another 
malloc.  The second malloc requests a mutex on malloc for the process 
(part of Solaris's thread internals), but that mutex is held by the 
first call, and hence the mutex lock will never return and the process 
is permanently hung, holding the lock for the mailbox.

Would anyone happen to have any tips on getting out from under this?

Thanks,
Michael Bacon
ITS Messaging
UNC Chapel Hill