The master janitor goes crazy / Re: Debugging Deadlocks

Дилян Палаузов dilyan.palauzov at aegee.org
Mon Dec 2 07:18:13 EST 2019


Hello Ellie,

this is exactly what I see (countless pselect calls), but I have as second parameter of pselect a much larger array.  I
just observed that on killing master, it terminates all cyrus processes but two (httpd and notifyd).  Then I try to
connect to that processes (gdb).  This does not work, however, since the processes are moved to zombie status.

Greetings
  Дилян

On Thu, 2019-11-28 at 10:34 +1100, ellie timoney wrote:
> Saw something similar just now when I killed a cassandane run off prematurely. One cyrus master process wound up spinning like this:
> 
> pselect6(13, [8 9 11 12], NULL, NULL, NULL, {[], 8}) = 1 (in [11])
> pselect6(13, [8 9 11 12], NULL, NULL, NULL, {[], 8}) = 1 (in [11])
> pselect6(13, [8 9 11 12], NULL, NULL, NULL, {[], 8}) = 1 (in [11])
> pselect6(13, [8 9 11 12], NULL, NULL, NULL, {[], 8}) = 1 (in [11])
> pselect6(13, [8 9 11 12], NULL, NULL, NULL, {[], 8}) = 1 (in [11])
> pselect6(13, [8 9 11 12], NULL, NULL, NULL, {[], 8}) = 1 (in [11])
> pselect6(13, [8 9 11 12], NULL, NULL, NULL, {[], 8}) = 1 (in [11])
> pselect6(13, [8 9 11 12], NULL, NULL, NULL, {[], 8}) = 1 (in [11])
> pselect6(13, [8 9 11 12], NULL, NULL, NULL, {[], 8}) = 1 (in [11])
> 
> 0x0000555ac7124a97 in child_janitor (now=...) at master/master.c:1221
> 1221	        janitor_position = janitor_position % child_table_size;
> (gdb) bt
> #0  0x0000555ac7124a97 in child_janitor (now=...) at master/master.c:1221
> #1  0x0000555ac712a67a in main (argc=10, argv=0x7ffdc1fe78b8)
>     at master/master.c:2812
> 
> Haven't dug further yet, but it looks similar to your report
> 
> On Wed, Nov 27, 2019, at 9:17 AM, ellie timoney wrote:
> > Can you strace the master process next time it's spinning at 100%?  
> > What is it doing at that time?
> > 
> > On Tue, Nov 26, 2019, at 1:29 AM, Дилян Палаузов wrote:
> > > Hello,
> > > 
> > > > I run cyrus imap 3.0.x with some private changes.
> > > > 
> > > > Sometimes when stop the master process, the master process utilizes one CPU core to 100% for 5 minutes.  After the fifth
> > > > minute, systemd enforces kill -9. When I attach to the maste process, I see that it some janitor does some work, but I
> > > > have not checked the details.  Has anybody experienced this?
> > > 
> > > I run cyrus imap.   At some moment I recompile and reinstall the 
> > > binaries, which in theory means that the binaries
> > > detect this change and restart theirselves.  At some moment I call 
> > > "systemctl stop cyrus-imap" which I guess sends
> > > SIGTERM to the master process.   Then the CPU utilization of the master 
> > > process goes to 100%.  In the systemd service
> > > file I have TimeoutStopSec=320 . After this time, the master process 
> > > continues running and systemd sends 9/SIGKILL.  It
> > > is not necessary that on re-installing the binaries, and then shutting 
> > > down the CPU goes to 100%: it is possible that
> > > the CPU goes to 100%, without reinstalling (and thus triggering 
> > > self-restarting) of the imapd/httpd binaries.
> > > 
> > > It is often, but not always, that this 100% CPU loop is entered on shutdown.
> > > 
> > > I have a webmail client and to speedup things it uses SquirrelMail's 
> > > IMAP Proxy (http://www.imapproxy.org/ a Caching
> > > IMAP proxy).  It is recommended in the installation manual of 
> > > Horde/IMP.  The IMAP caching proxy connects to
> > > 127.0.0.2:143 (and is therefore permitted to skip the TLS overload).  
> > > In master conf I have a line
> > > “imaplocal     cmd="imapd -C /usr/local/etc/cyrus/imapdlocal.conf" 
> > > listen="127.0.0.2:imap" prefork=0”.
> > > 
> > > When the CPU goes to 100% on shutdown I connect with gdb to the master 
> > > process.  Below is the full backtrace.  Does
> > > somebody have an explanation why the master process enters a never 
> > > ending loop?
> > > 
> > > I do not say that all above information has to be involved in the 
> > > anwer.  Has somebody else experienced this effects? 
> > > Any suggestions how to investigate this deeper?
> > > 
> > > Greetings
> > >   Дилян
> > > 
> > > ---
> > > warning: Could not load vsyscall page because no executable was 
> > > specified
> > > Reading symbols from /usr/local/libexec/master...
> > > Attaching to program: /usr/local/libexec/master, process 9247
> > > Reading symbols from /usr/local/lib/libcyrus_min.so.0...
> > > Reading symbols from /lib/libuuid.so.1...
> > > Reading symbols from /usr/local/lib/libgssapi_krb5.so.2...
> > > Reading symbols from /usr/local/lib/libkrb5.so.3...
> > > Reading symbols from /usr/local/lib/libk5crypto.so.3...
> > > Reading symbols from /usr/local/lib/libcom_err.so.3...
> > > Reading symbols from /usr/local/lib/libkrb5support.so.0...
> > > Reading symbols from /usr/local/lib/libpcreposix.so.0...
> > > (No debugging symbols found in /usr/local/lib/libpcreposix.so.0)
> > > Reading symbols from /usr/local/lib/libpcre.so.1...
> > > (No debugging symbols found in /usr/local/lib/libpcre.so.1)
> > > Reading symbols from /usr/local/lib/libxml2.so.2...
> > > Reading symbols from /usr/local/lib/liblzma.so.5...
> > > (No debugging symbols found in /usr/local/lib/liblzma.so.5)
> > > Reading symbols from /usr/local/lib/libical.so.3...
> > > Reading symbols from /usr/local/lib/libicalss.so.3...
> > > Reading symbols from /usr/local/lib/libicalvcal.so.3...
> > > Reading symbols from /usr/local/lib/libicui18n.so.63...
> > > Reading symbols from /usr/local/lib/libicuuc.so.63...
> > > Reading symbols from /usr/local/lib/libicudata.so.63...
> > > (No debugging symbols found in /usr/local/lib/libicudata.so.63)
> > > Reading symbols from /usr/local/lib/libsqlite3.so.0...
> > > (No debugging symbols found in /usr/local/lib/libsqlite3.so.0)
> > > Reading symbols from /usr/local/lib/libz.so.1...
> > > (No debugging symbols found in /usr/local/lib/libz.so.1)
> > > Reading symbols from /lib64/libm.so.6...
> > > Reading symbols from /lib64/libdl.so.2...
> > > Reading symbols from /lib64/libpthread.so.0...
> > > [Thread debugging using libthread_db enabled]
> > > Using host libthread_db library "/lib64/libthread_db.so.1".
> > > Reading symbols from /lib64/libc.so.6...
> > > Reading symbols from /lib64/ld-linux-x86-64.so.2...
> > > Reading symbols from /lib64/libresolv.so.2...
> > > Reading symbols from /usr/local/lib/libdb-18.1.so...
> > > Reading symbols from /usr/local/lib64/libstdc++.so.6...
> > > Reading symbols from /usr/local/lib64/libgcc_s.so.1...
> > > Reading symbols from /usr/local/lib64/libssl.so.1.1...
> > > Reading symbols from /usr/local/lib64/libcrypto.so.1.1...
> > > Reading symbols from /lib64/libnss_db.so.2...
> > > Reading symbols from /lib64/libnss_files.so.2...
> > > Reading symbols from /lib64/libnss_dns.so.2...
> > > 0x0000000000405406 in child_janitor (now=...) at master/master.c:1192
> > > 1192	        janitor_position = janitor_position % child_table_size;
> > > ™(gdb) bt f
> > >   Id   Target Id                                 Frame 
> > > * 1    Thread 0x7f6a08759780 (LWP 9247) "master" 0x0000000000405406 in 
> > > child_janitor (now=...) at master/master.c:1192
> > > #0  0x0000000000405406 in child_janitor (now=...) at 
> > > master/master.c:1192
> > >         i = 9299
> > >         p = 0x4132e0 <ctable+16224>
> > >         c = 0x0
> > > #1  0x0000000000409dd7 in main (argc=4, argv=0x7ffea3075108) at 
> > > master/master.c:2600
> > >         i = 14
> > >         ready_fds = 3
> > >         total_children = 11
> > >         tv = {
> > >           tv_sec = 0,
> > >           tv_usec = 0
> > >         }
> > >         msg = {
> > >           message = 1,
> > >           service_pid = 28219
> > >         }
> > >         maxfd = 41
> > >         tvptr = 0x0
> > >         interrupted = 0
> > >         pidfile = 0x40c4f0 "/var/run/cyrus-master.pid"
> > >         pidfile_lock = 0x2135ba0 "/usr/local/etc/cyrus/imapdlocal.conf"
> > >         startup_pipe = {6, 7}
> > >         pidlock_fd = -1
> > >         i = 14
> > >         opt = -1
> > >         close_std = 1
> > >         daemon_mode = 1
> > >         error_log = 0x0
> > >         alt_config = 0x0
> > >         fd = 3
> > >         rfds = {
> > >           fds_bits = {266272, 0 <repeats 15 times>}
> > >         }
> > >         r = 1
> > >         now = {
> > >           tv_sec = 1574690925,
> > >           tv_usec = 958878
> > >         }
> > >         p = 0x0
> > > quit
> > > Detaching from program: /usr/local/libexec/master, process 9247
> > > [Inferior 1 (process 9247) detached]
> > > 
> > > 
> > > 
> > > 



More information about the Cyrus-devel mailing list