The master janitor goes crazy / Re: Debugging Deadlocks

ellie timoney ellie at fastmail.com
Tue Dec 3 22:22:11 EST 2019


So, using my strace output from the other week as an example:

> pselect6(13, [8 9 11 12], NULL, NULL, NULL, {[], 8}) = 1 (in [11])

The arguments here are:

* nfds: 13 (12+1)
* readfds: [8 9 11 12]
* writefds: NULL
* exceptfds: NULL
* timeout: NULL
* sigmask: {[], 8}

The interesting bits are: 

* we don't have a timeout (so this pselect would block forever if nothing became ready)
* we're only waiting for fds to become readable (not writeable or having exceptions)
* we don't have a sigmask set (empty array of 8-byte objects)

The return value of 1 means that 1 of the fds was ready, and I surmise that "(in [11])" is telling us that it was fd 11 from the readfds set that was ready (for reading).

The fact these pselect calls are all the same tells me that either: a lot is happening on fd 11 and we're not keeping up, or that there's data waiting on fd 11 and we keep ignoring it (so it keeps telling us it's there).

The gdb backtrace isn't really useful here I don't think, I think it's coincidental that when we each attached a debugger we both happened to be at that particular line in child_janitor.  Once we're in shutdown, child_janitor is the only thing doing much work, and that line is the top of its loop.

I think the really useful information to collect next time this happens (and while the master process is still running) is:

* What does lsof tell us about that ready file descriptor (in the example, fd 11)?  I would be very interested to know if it's a client socket, or an internal messaging socket (that service processes use to tell master their status).

* If you can attach a debugger and step through a couple of iterations of master's big "for (;;) {" loop, what path is it taking?  What decisions is it making?

* Without the debugger, if you let it run like this for 30 seconds or more, does a syslog line like this appear? https://github.com/cyrusimap/cyrus-imapd/blob/96d194de82d3dbe124a359069bd21f5cba7519ba/master/master.c#L1240-L1244

Cheers,

ellie

On Mon, Dec 2, 2019, at 11:18 PM, Дилян Палаузов wrote:
> Hello Ellie,
> 
> this is exactly what I see (countless pselect calls), but I have as 
> second parameter of pselect a much larger array.  I
> just observed that on killing master, it terminates all cyrus processes 
> but two (httpd and notifyd).  Then I try to
> connect to that processes (gdb).  This does not work, however, since 
> the processes are moved to zombie status.
> 
> Greetings
>   Дилян
> 
> On Thu, 2019-11-28 at 10:34 +1100, ellie timoney wrote:
> > Saw something similar just now when I killed a cassandane run off prematurely. One cyrus master process wound up spinning like this:
> > 
> > pselect6(13, [8 9 11 12], NULL, NULL, NULL, {[], 8}) = 1 (in [11])
> > pselect6(13, [8 9 11 12], NULL, NULL, NULL, {[], 8}) = 1 (in [11])
> > pselect6(13, [8 9 11 12], NULL, NULL, NULL, {[], 8}) = 1 (in [11])
> > pselect6(13, [8 9 11 12], NULL, NULL, NULL, {[], 8}) = 1 (in [11])
> > pselect6(13, [8 9 11 12], NULL, NULL, NULL, {[], 8}) = 1 (in [11])
> > pselect6(13, [8 9 11 12], NULL, NULL, NULL, {[], 8}) = 1 (in [11])
> > pselect6(13, [8 9 11 12], NULL, NULL, NULL, {[], 8}) = 1 (in [11])
> > pselect6(13, [8 9 11 12], NULL, NULL, NULL, {[], 8}) = 1 (in [11])
> > pselect6(13, [8 9 11 12], NULL, NULL, NULL, {[], 8}) = 1 (in [11])
> > 
> > 0x0000555ac7124a97 in child_janitor (now=...) at master/master.c:1221
> > 1221	        janitor_position = janitor_position % child_table_size;
> > (gdb) bt
> > #0  0x0000555ac7124a97 in child_janitor (now=...) at master/master.c:1221
> > #1  0x0000555ac712a67a in main (argc=10, argv=0x7ffdc1fe78b8)
> >     at master/master.c:2812
> > 
> > Haven't dug further yet, but it looks similar to your report
> > 
> > On Wed, Nov 27, 2019, at 9:17 AM, ellie timoney wrote:
> > > Can you strace the master process next time it's spinning at 100%?  
> > > What is it doing at that time?
> > > 
> > > On Tue, Nov 26, 2019, at 1:29 AM, Дилян Палаузов wrote:
> > > > Hello,
> > > > 
> > > > > I run cyrus imap 3.0.x with some private changes.
> > > > > 
> > > > > Sometimes when stop the master process, the master process utilizes one CPU core to 100% for 5 minutes.  After the fifth
> > > > > minute, systemd enforces kill -9. When I attach to the maste process, I see that it some janitor does some work, but I
> > > > > have not checked the details.  Has anybody experienced this?
> > > > 
> > > > I run cyrus imap.   At some moment I recompile and reinstall the 
> > > > binaries, which in theory means that the binaries
> > > > detect this change and restart theirselves.  At some moment I call 
> > > > "systemctl stop cyrus-imap" which I guess sends
> > > > SIGTERM to the master process.   Then the CPU utilization of the master 
> > > > process goes to 100%.  In the systemd service
> > > > file I have TimeoutStopSec=320 . After this time, the master process 
> > > > continues running and systemd sends 9/SIGKILL.  It
> > > > is not necessary that on re-installing the binaries, and then shutting 
> > > > down the CPU goes to 100%: it is possible that
> > > > the CPU goes to 100%, without reinstalling (and thus triggering 
> > > > self-restarting) of the imapd/httpd binaries.
> > > > 
> > > > It is often, but not always, that this 100% CPU loop is entered on shutdown.
> > > > 
> > > > I have a webmail client and to speedup things it uses SquirrelMail's 
> > > > IMAP Proxy (http://www.imapproxy.org/ a Caching
> > > > IMAP proxy).  It is recommended in the installation manual of 
> > > > Horde/IMP.  The IMAP caching proxy connects to
> > > > 127.0.0.2:143 (and is therefore permitted to skip the TLS overload).  
> > > > In master conf I have a line
> > > > “imaplocal     cmd="imapd -C /usr/local/etc/cyrus/imapdlocal.conf" 
> > > > listen="127.0.0.2:imap" prefork=0”.
> > > > 
> > > > When the CPU goes to 100% on shutdown I connect with gdb to the master 
> > > > process.  Below is the full backtrace.  Does
> > > > somebody have an explanation why the master process enters a never 
> > > > ending loop?
> > > > 
> > > > I do not say that all above information has to be involved in the 
> > > > anwer.  Has somebody else experienced this effects? 
> > > > Any suggestions how to investigate this deeper?
> > > > 
> > > > Greetings
> > > >   Дилян
> > > > 
> > > > ---
> > > > warning: Could not load vsyscall page because no executable was 
> > > > specified
> > > > Reading symbols from /usr/local/libexec/master...
> > > > Attaching to program: /usr/local/libexec/master, process 9247
> > > > Reading symbols from /usr/local/lib/libcyrus_min.so.0...
> > > > Reading symbols from /lib/libuuid.so.1...
> > > > Reading symbols from /usr/local/lib/libgssapi_krb5.so.2...
> > > > Reading symbols from /usr/local/lib/libkrb5.so.3...
> > > > Reading symbols from /usr/local/lib/libk5crypto.so.3...
> > > > Reading symbols from /usr/local/lib/libcom_err.so.3...
> > > > Reading symbols from /usr/local/lib/libkrb5support.so.0...
> > > > Reading symbols from /usr/local/lib/libpcreposix.so.0...
> > > > (No debugging symbols found in /usr/local/lib/libpcreposix.so.0)
> > > > Reading symbols from /usr/local/lib/libpcre.so.1...
> > > > (No debugging symbols found in /usr/local/lib/libpcre.so.1)
> > > > Reading symbols from /usr/local/lib/libxml2.so.2...
> > > > Reading symbols from /usr/local/lib/liblzma.so.5...
> > > > (No debugging symbols found in /usr/local/lib/liblzma.so.5)
> > > > Reading symbols from /usr/local/lib/libical.so.3...
> > > > Reading symbols from /usr/local/lib/libicalss.so.3...
> > > > Reading symbols from /usr/local/lib/libicalvcal.so.3...
> > > > Reading symbols from /usr/local/lib/libicui18n.so.63...
> > > > Reading symbols from /usr/local/lib/libicuuc.so.63...
> > > > Reading symbols from /usr/local/lib/libicudata.so.63...
> > > > (No debugging symbols found in /usr/local/lib/libicudata.so.63)
> > > > Reading symbols from /usr/local/lib/libsqlite3.so.0...
> > > > (No debugging symbols found in /usr/local/lib/libsqlite3.so.0)
> > > > Reading symbols from /usr/local/lib/libz.so.1...
> > > > (No debugging symbols found in /usr/local/lib/libz.so.1)
> > > > Reading symbols from /lib64/libm.so.6...
> > > > Reading symbols from /lib64/libdl.so.2...
> > > > Reading symbols from /lib64/libpthread.so.0...
> > > > [Thread debugging using libthread_db enabled]
> > > > Using host libthread_db library "/lib64/libthread_db.so.1".
> > > > Reading symbols from /lib64/libc.so.6...
> > > > Reading symbols from /lib64/ld-linux-x86-64.so.2...
> > > > Reading symbols from /lib64/libresolv.so.2...
> > > > Reading symbols from /usr/local/lib/libdb-18.1.so...
> > > > Reading symbols from /usr/local/lib64/libstdc++.so.6...
> > > > Reading symbols from /usr/local/lib64/libgcc_s.so.1...
> > > > Reading symbols from /usr/local/lib64/libssl.so.1.1...
> > > > Reading symbols from /usr/local/lib64/libcrypto.so.1.1...
> > > > Reading symbols from /lib64/libnss_db.so.2...
> > > > Reading symbols from /lib64/libnss_files.so.2...
> > > > Reading symbols from /lib64/libnss_dns.so.2...
> > > > 0x0000000000405406 in child_janitor (now=...) at master/master.c:1192
> > > > 1192	        janitor_position = janitor_position % child_table_size;
> > > > ™(gdb) bt f
> > > >   Id   Target Id                                 Frame 
> > > > * 1    Thread 0x7f6a08759780 (LWP 9247) "master" 0x0000000000405406 in 
> > > > child_janitor (now=...) at master/master.c:1192
> > > > #0  0x0000000000405406 in child_janitor (now=...) at 
> > > > master/master.c:1192
> > > >         i = 9299
> > > >         p = 0x4132e0 <ctable+16224>
> > > >         c = 0x0
> > > > #1  0x0000000000409dd7 in main (argc=4, argv=0x7ffea3075108) at 
> > > > master/master.c:2600
> > > >         i = 14
> > > >         ready_fds = 3
> > > >         total_children = 11
> > > >         tv = {
> > > >           tv_sec = 0,
> > > >           tv_usec = 0
> > > >         }
> > > >         msg = {
> > > >           message = 1,
> > > >           service_pid = 28219
> > > >         }
> > > >         maxfd = 41
> > > >         tvptr = 0x0
> > > >         interrupted = 0
> > > >         pidfile = 0x40c4f0 "/var/run/cyrus-master.pid"
> > > >         pidfile_lock = 0x2135ba0 "/usr/local/etc/cyrus/imapdlocal.conf"
> > > >         startup_pipe = {6, 7}
> > > >         pidlock_fd = -1
> > > >         i = 14
> > > >         opt = -1
> > > >         close_std = 1
> > > >         daemon_mode = 1
> > > >         error_log = 0x0
> > > >         alt_config = 0x0
> > > >         fd = 3
> > > >         rfds = {
> > > >           fds_bits = {266272, 0 <repeats 15 times>}
> > > >         }
> > > >         r = 1
> > > >         now = {
> > > >           tv_sec = 1574690925,
> > > >           tv_usec = 958878
> > > >         }
> > > >         p = 0x0
> > > > quit
> > > > Detaching from program: /usr/local/libexec/master, process 9247
> > > > [Inferior 1 (process 9247) detached]
> > > > 
> > > > 
> > > > 
> > > > 
> 
>


More information about the Cyrus-devel mailing list