Debugging Deadlocks

Tue Nov 19 15:56:49 EST 2019

Hello,

I run cyrus imap 3.0.x with some private changes.

Sometimes when stop the master process, the master process utilizes one CPU core to 100% for 5 minutes.  After the fifth
minute, systemd enforces kill -9. When I attach to the maste process, I see that it some janitor does some work, but I
have not checked the details.  Has anybody experienced this?

I have very few users, but one of the users (me) uses many client simuitaneously.  Lets say two IMAP clients, making 4-6 
connections in parallel and three CalDAV clients, doing estimated 3-6 connections in parallel.  The httpd process is
behind a proxy and most of the time the proxy server manages to serialize the requests, and in fact a single httpd
process handles the requests.  At least it is not visible that under normal circumstances there is a second running
httpd process.  Under normal circumstances I see also a single lmtpd process and many imapd processes.

On some days I observe that the IMAP client cannot obtain list of new messages, it just times out.  This could because
of deadlocks, but it can be because on that particular day the IO is extremely slow and thus the problem is not withn
cyrus.  Sometimes I observe afterwards that tha INBOX index is being rebuild.  Sometimes, after the INBOX index is
rebuild things start working.

So on such days I suspect that there is some deadlock.  Lets say, if there are two or more long-term running lmtpd
processes, then I suspect a deadlock.  What approach can use to find where the deadlock is and how can get rid of it?

I can attach to a process with STRACE, get the current backtrace and variable values with GDB and I can see (eg. with
LSOF) which files are opened in which mode.  But I do not know what to look for.  Or rather, when I try investigating,
almost always I see a process rebuiding my INBOX index and after waiting, waiting, waiting, eventually the INDEX is
rebuild.  How can I find out why the index was broken?

Greetings
  Дилян