Severe issue with high-volume IMAP server stalling, possible patch

Thu Apr 14 22:54:54 EDT 2016

Hi Jens,

I applied this patch (to the current cyrus-imapd-2.5 HEAD) and tried it
out.

It is failing our Cassandane tests, because the test suite is unable to
cleanly shutdown the master process after each test finishes.

Are you able to cleanly shut down the master process with a SIGTERM in
your environment?  Can you cleanly shut one down that has very little
activity on it?

I had a look in the log from the tests, and the "attempting clean
shutdown on signal" message appears from the master process, so it looks
like begin_shutdown() (which emits this message, and sets in_shutdown=1)
is being called correctly, and the signal handling is working okay.  But
for some reason the process no longer exits.

I'm having a look at the way tvptr is set.  If the process is
in_shutdown, then tvptr remains NULL (which tells myselect() to block
indefinitely and never time out).  I'm not sure if this is going to turn
out to be significant here.  I'm also not sure why we do this
differently for this case.

When Cassandane is unable to shut master down cleanly, it kills it hard
with SIGILL ("illegal instruction"), which master doesn't handle, and so
we get (by kernel default) a core to look at.  Looking at the core, it
was blocked in myselect() [actually, __pselect()].  Looking at main()'s
execution frame, interrupted was currently 4 at the time Cassandane
nuked it.  So it would have needed another two less-severe signals for
interrupted to exceed 5 and the loop to end.

What I think is happening is that, once shutdown is initiated, the only
way myselect() can return is if one of the rfds becomes active, or if a
signal arrives.  It will no longer time out.  But, once shutdown is
initiated, at a certain point rfds are no longer becoming active
(because all the children have themselves shut down), and signals are no
longer occurring, so myselect() ends up blocked forever.

Maybe the answer is to break out of the myselect loop after one
iteration if we're in_shutdown, so that we don't block indefinitely
waiting for iterations that will never occur.  I think just changing the
loop condition from 
    while (r == -1)
to
    while (!in_shutdown && r == -1)
might do the trick.  

And I've just tried this: the Cassandane tests now work again, but I
don't know what effect this will have on the problem you've been
experiencing.  My version is here:
https://github.com/elliefm/cyrus-imapd/tree/v25/jens-master-eintr

This behaviour also suggests that a reasonable limit for interruptions
might be one that scales with the number of child processes (rather than
the arbitrary 5), but I don't think this alone would be enough to avoid
the problem I just encountered.  If the remaining children all exit at
the same time, and get processed by a single myselect(), then there's
still n-1 loop iterations before shutdown would occur -- same problem.

Cheers,

ellie

On Tue, Apr 12, 2016, at 07:43 PM, Jens Erat wrote:
> Hi Ellie,
> 
> find the patch attached (developed against Cyrus IMAP 2.5.7 tarball
> release).
> 
> I've additionally removed the unnecessary bookkeeping for our debug
> output and lowered the maximum retries to a less arbitrary 5 (which
> seems still pretty arbitrary). Only retrying once might at least reduce
> response times for some weird setups, we never experienced more than a
> single retry. This might still be subject of discussion and seems
> something to be moved in a constant and/or header file?
> 
> Regards,
> Jens
> 
> Am 12.04.2016 um 01:32 schrieb ellie timoney via Cyrus-devel:
> > Hi Jens,
> > 
> >> after some weeks of running a stable service with this patch, I'd like
> >> to propose merging this (we'd really like to switch back to an unpatched
> >> Cyrus release in future).
> > 
> > This would be good.  Can you attach it as a patch file please?  CC me
> > and I'll get it merged.
> > 
> > Thanks for your hard work figuring this out. :)
> > 
> > Cheers,
> > 
> > ellie
> > 
> > On Mon, Apr 11, 2016, at 05:52 PM, Jens Erat via Cyrus-devel wrote:
> >> Hi all,
> >>
> >> after some weeks of running a stable service with this patch, I'd like
> >> to propose merging this (we'd really like to switch back to an unpatched
> >> Cyrus release in future). This is not only an issue to us, but we know
> >> some other universities running similar Solaris setups and waiting to
> >> upgrade to Cyrus 2.5 because of our issues.
> >>
> >> This is only an issue with Cyrus IMAP 2.5 and probably 3.0; Cyrus 2.4
> >> also does not properly repeat select if it failed, but it's not an issue
> >> here as interrupts can occur anywhere.
> >>
> >> We didn't have those master loop hick ups any more. In fact, the while
> >> loop even seems superfluous; just running myselect again if `r==0` would
> >> be fine, but this might be different with other combinations of
> >> operating system kernels, C libraries and compilers, and the loop seems
> >> the safer thing to do. The limit of 100 retries seems a little bit
> >> arbitrary, but I can't really argument for any number, something around
> >> five should probably also be fine to cover all scenarios that are not
> >> completely broken, anyway.
> >>
> >> We dumped the interrupted and again counters each time the master loop
> >> ran (and several other metrics), and never had more than one EINTR or
> >> EAGAIN in a row (so the next `myselect` immediately returned the FDSET).
> >>
> >> Kind regards from Lake Constance, Germany,
> >> Jens
> >>
> >> Am 10.02.2016 um 13:39 schrieb Jens Erat via Cyrus-devel:
> >>> Hi Bron,
> >>>
> >>> digging deeper, we realized there are still issues with the loop you
> >>> proposed. `select` will only decrement tvptr on Linux (and maybe _some_
> >>> other systems, definitely not all of them), pselect will never do -- you
> >>> might end up waiting forever here if nothing happens.
> >>>
> >>> On the other hand, we partially got a grasp of understanding what was
> >>> going on. Because when `pselect` is available, signalling was disabled
> >>> but during `pselect`, thus with some weird combination of load on the
> >>> different services, each `pselect` returned with `EINTR` and the message
> >>> handling code was never reached
> >>>
> >>> We ended up putting the whole deadline calculation in a queue, and have
> >>> this code live at the moment:
> >>>
> >>>         int interrupted = 0;
> >>>         int again = 0;
> >>>         do {
> >>>             /* how long to wait? - do now so that any scheduled wakeup
> >>>              * calls get accounted for*/
> >>>             gettimeofday(&now, 0);
> >>>             tvptr = NULL;
> >>>             if (schedule && !in_shutdown) {
> >>>                 double delay = timesub(&now, &schedule->mark);
> >>>                 if (!interrupted && delay > 0.0) {
> >>>                     timeval_set_double(&tv, delay);
> >>>                 }
> >>>                 else {
> >>>                     tv.tv_sec = 0;
> >>>                     tv.tv_usec = 0;
> >>>                 }
> >>>                 tvptr = &tv;
> >>>             }
> >>>
> >>> #if defined(HAVE_UCDSNMP) || defined(HAVE_NETSNMP)
> >>>             if (tvptr == NULL) blockp = 1;
> >>>             snmp_select_info(&maxfd, &rfds, tvptr, &blockp);
> >>> #endif
> >>>             errno = 0;
> >>>             r = myselect(maxfd, &rfds, NULL, NULL, tvptr);
> >>>
> >>>             if (r == -1) switch(errno) {
> >>>             /* Try again to get valid rfds, this time without blocking so we
> >>>              * will definitely process messages without getting interrupted
> >>>              * again. */
> >>>             case EINTR:
> >>>                 interrupted++;
> >>>                 if (interrupted > 100) {
> >>>                     syslog(LOG_WARNING, "Repeatedly interrupted, too
> >>> many signals?");
> >>>                     /* Fake a timeout */
> >>>                     r = 0;
> >>>                     FD_ZERO(&rfds);
> >>>                 }
> >>>                 break;
> >>>             /* Try again. */
> >>>             case EAGAIN: again++; break;
> >>>             /* uh oh */
> >>>             default: fatalf(1, "select failed: %m");
> >>>             }
> >>>         } while (r == -1);
> >>>
> >>> We also dump interrupted and again together with some other variables at
> >>> the end of each loop. The server is running fine for two days now, and
> >>> the dumped information looks fine (we never experienced any delays >4
> >>> seconds since patching the server, we often found the master loop not
> >>> completely running for minutes). I will report again if we didn't suffer
> >>> from any interruptions for the next one or two weeks.
> >>>
> >>> Some remarks to the code:
> >>>
> >>> - We need to update the current timestamp `now` for decrementing the
> >>> passed time.
> >>> - `snmp_select_info` or's its FDs to rfds; thus it may be run repeatedly
> >>> to update `tvptr` and not run over any SNMP events.
> >>> - We are compiling without SNMP, thus this code is untested.
> >>> - If interrupted, we try again to receive FDs, but without waiting
> >>> (tvptr set to zero). Because of this, no large interruptions should
> >>> occur -- but we also won't skip over message handling when flooded with
> >>> signals.
> >>> - After 100 retries after being interrupted, we empty rfds and continue
> >>> to work, but print a warning. This is probably subject of discussion and
> >>> _much_ too large: we're currenlty dumping the number of interruptions,
> >>> and never had more than one at all before catching file descriptors (or
> >>> none).
> >>> - `int again` can and should be removed for live code, does not seem
> >>> relevant at all (but for our own debug purpose).
> >>>
> >>> Kind regards from Lake Constance, Germany,
> >>> Jens
> >>>
> >>>
> >>> Am 06.02.2016 um 13:00 schrieb Bron Gondwana:
> >>>> I could buy repeating the whole thing for EINTR but not for EAGAIN.  Obviously everything is fine under non-heavy load no matter what.
> >>>>
> >>>> I'm pretty sure your patch is wrong, because it will process the fdset that the select was supposed to fill even if the select returned an error.
> >>>>
> >>>> You may have convinced me that my patch is wrong as well, and that we have to always check for the in_shutdown and sigquit magic in that tight loop in case it was a signal that needs to be responded to.
> >>>>
> >>>> Bron.
> >>>>
> >>>> On Sat, Feb 6, 2016, at 20:52, Jens Erat wrote:
> >>>>> Hi Bron,
> >>>>>
> >>>>> thank you for the fast reply.
> >>>>>
> >>>>>> One thing I would suggest is running something like nginx in front of
> >>>>> Cyrus, even if you only need it for authentication checks.
> >>>>>
> >>>>> We'll discuss this internally.
> >>>>>
> >>>>>> I believe this is the correct patch. If the select was interrupted
> >>>>> then it needs to be re-run, but there is no need to repeat all the other
> >>>>> work, so:
> >>>>>>
> >>>>>> - r = myselect(maxfd, &rfds, NULL, NULL, tvptr);
> >>>>>> - if (r == -1 && errno == EAGAIN) continue;
> >>>>>> - if (r == -1 && errno == EINTR) continue;
> >>>>>> + do {
> >>>>>> + r = myselect(maxfd, &rfds, NULL, NULL, tvptr);
> >>>>>> + } while (r == -1 && (errno == EAGAIN || errno == EINTR));
> >>>
> >>
> >> -- 
> >> Jens Erat
> >> Universität Konstanz
> >> Kommunikations-, Infomations-, Medienzentrum (KIM)
> >> Abteilung Basisdienste
> >> D-78457 Konstanz
> >> Mail: jens.erat at uni-konstanz.de
> >>
> >> Email had 1 attachment:
> >> + smime.p7s
> >>   7k (application/pkcs7-signature)
> 
> -- 
> Jens Erat
> Universität Konstanz
> Kommunikations-, Infomations-, Medienzentrum (KIM)
> Abteilung Basisdienste
> D-78457 Konstanz
> Mail: jens.erat at uni-konstanz.de
> Email had 2 attachments:
> + master.c-EINTR.patch
>   2k (text/x-patch)
> + smime.p7s
>   7k (application/pkcs7-signature)