Build failed in Jenkins: cyrus-imapd-master #410

Thu Feb 23 22:40:13 EST 2012

On Thu, Feb 23, 2012, at 05:07 AM, Jenkins wrote:
> See <http://ci.cyrusimap.org/job/cyrus-imapd-master/410/>
> 
> ------------------------------------------
> [...]
> + exitcode=1
> + '[' -x jenkins-xml-summary.pl ']'
> + ./jenkins-xml-summary.pl
> --build-url=http://ci.cyrusimap.org/job/cyrus-imapd-master/410/
> Test failures and errors summary
> ================================

Ok, so the good news first.

 * these test failure report URLs are now correct

 * the stray process killer in autobuild.sh is working

Now for the test failures.  Thirty-four of these thirty-five appear to be of the nature

> test_dup_keep_keep(Cassandane::Cyrus::Sieve)
>      Perl exception: Some process is already listening on 127.0.0.1:9101

which AFAICS are all cascading failures due to Cyrus failing to shut down correctly here:

> 
> Cassandane::Cyrus::Bug3072.copy_longset
>     http://ci.cyrusimap.org/job/cyrus-imapd-master/410//testReport/%28root%29/Cassandane__Cyrus__Bug3072/test_copy_longset/

which is thorny one.  The error message is

Error Message

Perl exception: Cannot shut down master pid 14360     

Stacktrace

test_copy_longset(Cassandane::Cyrus::Bug3072)
     Perl exception: Cannot shut down master pid 14360
     at Cassandane/Instance.pm line 831
    	Cassandane::Instance::stop('Cassandane::Instance=HASH(0x1db7eb80)') called at Cassandane/Cyrus/TestCase.pm line 255
    	Cassandane::Cyrus::TestCase::tear_down('Cassandane::Cyrus::Bug3072=HASH(0x1c2642d0)') called at Cassandane/Cyrus/Bug3072.pm line 64
    	Cassandane::Cyrus::Bug3072::tear_down('Cassandane::Cyrus::Bug3072=HASH(0x1c2642d0)') called at /usr/lib/perl5/vendor_perl/5.8.8/Test/Unit/TestCase.pm line 65

So Cassandane::Instance::_stop_pid() returned false.  This happens because it fails to shut
down the master process gracefully, where "graceful" means the master terminates within about
20 seconds of receiving SIGQUIT.

We can see the sequence in syslog

Feb 23 05:04:55 hudson-01 cassandane: =====> Instance[826] stop 
Feb 23 05:04:55 hudson-01 cassandane: =====> Instance[804] _stop_pid: sending signal 3 to 14360 
Feb 23 05:05:24 hudson-01 cassandane: =====> Instance[811] _stop_pid: failed to shut down pid 14360 with signal 3 
Feb 23 05:05:24 hudson-01 cassandane: =====> Instance[804] _stop_pid: sending signal 4 to 14360 
Feb 23 05:05:24 hudson-01 cassandane: =====> Util::Wait[75] Waited 0.011758 sec for unknown condition 

At 05:04:55, _stop_pid() sent SIGQUIT (signal 3) to the master process.  Twenty-nine seconds
later, it gave up waiting and sent SIGILL (signal 4) in the hope of getting a core file.  Twelve
milliseconds later that hope was fulfilled.

There's nothing else interesting in the log.  In particular, no unusual messages from master.
The previous message which mentions instance starting is

Feb 23 05:03:17 hudson-01 cassandane: =====> Instance[688] start main instance for test test_copy_longset: basedir /var/tmp/cass/1003145

I logged in earlier this morning and grabbed a copy of that directory before the next
build ran (each build removes the leftovers from the previous one and I'm now wondering
whether that's a good thing).

gnb at hudson-01 2002 ls -l 1003145/conf/cores/
total 840
-rw------- 1 gnb gnb 851968 Feb 23 05:05 core.14360

gnb at hudson-01 2004 file 1003145/conf/cores/core.14360 
1003145/conf/cores/core.14360: ELF 64-bit LSB core file AMD x86-64, version 1 (SYSV), SVR4-style, from 'master'

gnb at hudson-01 2005 gdb /var/lib/jenkins/jobs/cyrus-imapd-master/workspace/inst/usr/cyrus/bin/master 1003145/conf/cores/core.14360
...
Program terminated with signal 4, Illegal instruction.
#0  0x0000000000406903 in child_janitor (now=...) at master.c:1060
1060		janitor_position = janitor_position % child_table_size;

(gdb) bt
#0  0x0000000000406903 in child_janitor (now=...) at master.c:1060
#1  0x000000000040c916 in main (argc=12, argv=0x7fffade2ce88) at master.c:2272

At this point I poked around and eventually worked out that master was busy
looping into and out of the main select loop and the janitor code, but the janitor
code itself wasn't stuck in a loop.

The child table contains exactly one entry.

(gdb) p ctable
$1 = {0x0 <repeats 4366 times>, 0x13be10a0, 0x0 <repeats 5633 times>}

(gdb) p *(struct centry *)0x13be10a0
$2 = {pid = 14366, service_state = SERVICE_STATE_READY, janitor_deadline = 0, si = 0, next = 0x0}

According to syslog, pid 14366 was an imapd

Feb 23 05:03:17 hudson-01 1003145/imap[14366]: login: localhost.localdomain [127.0.0.1] admin plaintext User logged in SESSIONID=<1003145-14366-1329991397-1>

This explains why master didn't die, it's waiting for this child imapd to die.  Indeed,
we see the struct service has the correct nactive and ready_workers counts

(gdb) p Services[0]
$5 = {name = 0x13be12c0 "imap", listen = 0x13be1120 "127.0.0.1:9104", proto = 0x13be1280 "tcp", exec = 0x13be1360, babysit = 0, 
  associate = 0, family = 2, socket = 5, stat = {7, 9}, desired_workers = 0, max_workers = 2147483647, maxfds = 256, 
  maxforkrate = 0, ready_workers = 1, nforks = 1, nactive = 1, nconnections = 2, forkrate = 0, nreadyfails = 0, 
  last_interval_start = {tv_sec = 1329991397, tv_usec = 543872}, interval_forks = 1}

So what went wrong?

1078	static void sigquit_handler(int sig __attribute__((unused)))
1079	{
...
1095	    /* Set a flag so main loop knows to shut down when
1096	       all children have exited */
1097	    in_shutdown = 1;

(gdb) p in_shutdown
$6 = 1

So we did receive SIGQUIT and run sigquit_handler, which re-sends SIGQUIT to
master's process group.  That should have killed off imapd.  Because autobuild.sh
now cleans up stray processes, there's no evidence that it failed to shut down.  We
don't have 'auditlog' set in imapd.conf so there's no message in syslog when an
imapd shuts down gracefully, and so no evidence that it succeeded in shutting
down.  What we have in fact is a big question mark.

There's a number of possible failure modes here, but basically either

a) imapd might have died and master failed to notice.

b) imapd might have decided not to die.

So did happen in imapd?  From syslog we see

Feb 23 05:04:55 hudson-01 1003145/imap[14366]: USAGE cassandane user: 2.518617 sys: 1.865716

so telemetry_rusage() was called.  We don't create any shutdown file in this test,
so this means imapd processed a LOGOUT command, which is to be expected as
Cassandane is running the test's tear_down method. This means cmdloop() would
have returned, and service_main() also, back to main() in master/service.c.

At this point, imapd would have sent a MASTER_SERVICE_AVAILABLE status
message to master, which is consistent with the centry being in state
SERVICE_STATE_READY.  Imapd then goes back around and accept()s.  What
probably happens here is that accept() returns with EINTR and signals_poll() is
called.

It seems like signals_poll() should be driving the whole sequence of dealing
with SIGQUIT, by detecting it and calling shut_down(). Here's what it does

107 int signals_poll(void)
108 {
109     switch (gotsignal) {
110     case SIGINT:
111     case SIGQUIT:
112         if (shutdown_cb) shutdown_cb(EC_TEMPFAIL);
113         else exit(EC_TEMPFAIL);
114         break;
115     default:
116         return gotsignal;
117         break;
118     }           
119     return 0; /* compiler warning stupidity */
120 }

and where is gotsignal set?

 55 static volatile sig_atomic_t gotsignal = 0;
 56 
 57 static void sighandler(int sig)
 58 {
 59     /* syslog(LOG_DEBUG, "got signal %d", sig); */
 60     gotsignal = sig;
 61 }

and guess what, sighandler is set up as the signal handler for SIGQUIT,
SIGALRM, SIGHUP and SIGINT.  So we handle only the *last received*
of all those signals.  If SIGQUIT is received, followed by (say) SIGALRM,
we never even notice SIGQUIT. 

Hmm, that's definitely a bug and would explain the observed behaviour,
although it's not the only possibility.  I'll raise a Bugzilla ticket.

-- 
Greg.