Severe issue with high-volume IMAP server stalling, possible patch

Wed Apr 27 00:30:53 EDT 2016

Hi again,

> I'll read through the thread again and see if it becomes clear, and
> won't push the patch to the 2.5 branch just yet in case this changes it.

I've read back through the thread, and it looks like our goals here are
to:

   * acknowledge that, with pselect() available, SIGCHLD, SIGQUIT,
   SIGINT and SIGTERM signals will -only- arrive during a call to
   myselect(), because they are blocked at all other times -- so we must
   expect a server with lots of short-lived processes to frequently
   interrupt out of myselect()
   * handle signals as soon as possible after they arrive
   * allow child_janitor() an opportunity to run at least sometimes
   during interrupt-heavy periods
   * minimise useless work

In reading about how select, pselect, signals and EAGAIN interact, I've
noticed something: we currently treat the return value from myselect()
as either success or failure.  We handle a few classes of failure but
treat all successes as identical.

But the return value is actually a number of ready descriptors (with
negative -- specifically, -1 -- indicating failure).  We can use this to
apply the post-select processing sensibly based on whether it's relevant
or not.  This simplifies a few things:

   * no more jump-back-to-top-on-signal, so we never skip the end of the
   loop entirely
   * we only need to loop through the services if we know there are
   going to be ready file descriptors for at least some of them
   * child_janitor() gets to run even during interrupt storms
   * SNMP handling can be selective (based on examples from SNMP docs,
   but untested)
   * no need for an inner loop around myselect(), so no contortions to
   make that safe, and signals don't get stuck

Alternative patch implementing this approach is here:
https://github.com/elliefm/cyrus-imapd/tree/v25/alt-master-eintr

It seems fine to run child_janitor() at high frequency -- it already
contains calculations to limit its own processing to one round per
second, based on time elapsed since the last run.

Thoughts?

ellie

On Wed, Apr 27, 2016, at 12:20 PM, ellie timoney via Cyrus-devel wrote:
> Hi Jens,
> 
> > your argumentation seems legit.
> 
> Thanks :)
> 
> > I'm not sure, though whether the master
> > should _ever_ wait forever (NULL-pointer). Maybe this should never
> > happen, but I'd propose _not_ waiting would be safer than waiting
> > forever in some weird cases. And it should never happen but during
> > startup and shutdown. Not waiting might result in some CPU cycles
> > wasted, but the general code might prove more robust to other minor
> > issues.
> 
> I can see a plausible argument here both ways: on the one hand, if we
> never use a NULL timeout, then we'll never hang like we did during
> shutdown in this particular case.  On the other hand, waking up master
> periodically when it has no work to do will potentially degrade
> performance for the actual services that are trying to run.  
> 
> Also, hanging at shutdown in the way we did helped us identify the
> problem really quickly and fix it before it made it into a release.
> 
> I'm not really sure how frequently master is likely to wake up during
> "normal" operation, but I expect it varies a lot by environment and
> configuration.  It's hard to estimate where a reasonable compromise
> between "don't block indefinitely" and "don't block at all" might lie.
> 
> I could see the addition of some kind of hang-notification wakeup,
> whereby if master has been asleep for longer than some configurable
> duration, it wakes up and logs a warning.  The natural spot to do this,
> though, would be as another scheduled event, which wouldn't do anything
> in the particular case of "no schedule" that I think you're interested
> in.  So to be useful at all it would require bigger changes to the tvptr
> calculation.
> 
> I've had a look through the history of master.c, and the timeout has
> been NULL since the master process was conceived.  Only the addition of
> the event scheduler introduced the optional timeout.  Given it's been
> working well enough like this basically forever, I don't want to change
> it on our stable branch.  I would consider a feature like the above for
> inclusion in 3.0, though, if anyone was interested enough to contribute
> a patch?
> 
> > You have the better understanding of the overall master loop logic and
> > what to do best
> 
> You flatter me, but I'm figuring this out as I go too :)
> 
> > I'm fine with either solutions (as long as it works).
> 
> I'll push what we've got (your patch plus my shutdown fix) to the 2.5
> branch for inclusion in the next release, and also forward port it to
> the master branch...
> 
> Though: I've just noticed another possible issue.  We might get stuck in
> that inner loop if select keeps raising EAGAIN (it has no counter, and
> the "break" statement breaks out of the switch, not the inner loop).
> 
> My system's documentation for select claims that "some other UNIX
> systems" can raise EGAIN (despite POSIX not specifying this error for
> select), and suggests portable programs should retry as if for EINTR in
> this case.
> 
> I guess EAGAIN didn't occur in your profiling, but maybe your system
> can't raise EGAIN from select either?
> 
> The old (prior to patch) behaviour for EAGAIN was to resume the outer
> loop from the top, same as EINTR.
> 
> Right now I'm not sure if we should try to keep the old behaviour (retry
> from top on EGAIN), try to keep it the same as EINTR (so give it a
> counter and a limit), or do something new (like just pretend it was a
> timeout, like we now do after too many interrupts, so that the end of
> loop processing can occur).  Any thoughts?
> 
> I'll read through the thread again and see if it becomes clear, and
> won't push the patch to the 2.5 branch just yet in case this changes it.
> 
> Cheers,
> 
> ellie
> 
> On Tue, Apr 26, 2016, at 08:08 PM, Jens Erat wrote:
> > Hi Ellie,
> > 
> > your argumentation seems legit. I'm not sure, though whether the master
> > should _ever_ wait forever (NULL-pointer). Maybe this should never
> > happen, but I'd propose _not_ waiting would be safer than waiting
> > forever in some weird cases. And it should never happen but during
> > startup and shutdown. Not waiting might result in some CPU cycles
> > wasted, but the general code might prove more robust to other minor
> > issues.
> > 
> > You have the better understanding of the overall master loop logic and
> > what to do best, I'm fine with either solutions (as long as it works).
> > 
> > Regards,
> > Jens
> > 
> > Am 26.04.2016 um 07:18 schrieb ellie timoney:
> > > Hi Jens,
> > > 
> > >>> I'm having a look at the way tvptr is set.  If the process is
> > >>> in_shutdown, then tvptr remains NULL (which tells myselect() to block
> > >>> indefinitely and never time out).  I'm not sure if this is going to turn
> > >>> out to be significant here.  I'm also not sure why we do this
> > >>> differently for this case.
> > >>>
> > >>> [...]
> > >>>
> > >>> What I think is happening is that, once shutdown is initiated, the only
> > >>> way myselect() can return is if one of the rfds becomes active, or if a
> > >>> signal arrives.  It will no longer time out.  But, once shutdown is
> > >>> initiated, at a certain point rfds are no longer becoming active
> > >>> (because all the children have themselves shut down), and signals are no
> > >>> longer occurring, so myselect() ends up blocked forever.
> > >>
> > >> I don't think we should block at all in this case. The master loop
> > >> should keep running and clean up. This is definitely a bug in my patch.
> > > 
> > > I've read through the old/new code some more, and I don't think this is
> > > a bug in your patch.
> > > 
> > > The logic prior to your patch was that myselect would get a NULL timeout
> > > by default (and thus block indefinitely waiting for a signal or ready
> > > file descriptors).  This makes sense: we want to block until there is
> > > something to do.
> > > 
> > > If there were a schedule (and we were not shutting down), then we would
> > > inspect it and set up a timeout, so that we would also unblock when it
> > > was time to process that.  The timeout might be 0,0 if the next event
> > > were already overdue, so that it could be processed immediately.  
> > > 
> > > If we were shutting down, we'd no longer be processing scheduled events,
> > > so a scheduled event being due was not cause for a wakeup in this case.
> > > 
> > > Your patch preserves this behaviour, correctly, I think.
> > > 
> > > Where it causes problems is that it would now potentially call (and
> > > block in) myselect multiple times in a row without doing any other
> > > processing in between.  Which seems fine and good for the case we're
> > > trying to fix, but turns out to be bad during shutdown.
> > > 
> > > During shutdown, we expect to get a bunch of SIGCHLDs for the exiting
> > > children, which are reaped in the outer loop, before the myselect() call
> > > (and after reaping, if there are no more children, we exit). But with
> > > pselect, SIGCHLD is blocked until pselect temporarily unblocks it, so
> > > these signals will always arrive during myselect().  This means that
> > > during shutdown, myselect() will legitimately return -1/EINTR a number
> > > of times, as the child processes exit, and we need to reap them and
> > > decrement counters in order to exit cleanly.
> > > 
> > > With your patch, we no longer reap/decrement/exit on a signal unless
> > > we've also coincidentally exceeded 5 consecutive EINTRs, and have thus
> > > broken out of the myselect() loop -- which doesn't necessarily happen.
> > > 
> > > Understanding the cause, I'm even more sure that my fix for the shutdown
> > > problem is correct: when we're in_shutdown, we /want/ any single EINTR
> > > to send us back around the outer loop so we can reap children and maybe
> > > exit.
> > > 
> > >>> Maybe the answer is to break out of the myselect loop after one
> > >>> iteration if we're in_shutdown, so that we don't block indefinitely
> > >>> waiting for iterations that will never occur.  I think just changing the
> > >>> loop condition from 
> > >>>     while (r == -1)
> > >>> to
> > >>>     while (!in_shutdown && r == -1)
> > >>> might do the trick.  
> > >>
> > >> I'd instead set a non-null, but zero timeout, so select will still clean
> > >> up the signal mask and query for FDs. I pushed an alternative fixup on
> > >> GitHub:
> > >>
> > >> https://github.com/JensErat/cyrus-imapd/commit/bc3bacf70261711a9c810107cddb081df65a552c
> > > 
> > > This is a much bigger change.  It inverts the logic ("wait indefinitely
> > > for something to do, unless schedule says to wake earlier" => "don't
> > > wait, unless schedule says how long to wait").  It also depends on there
> > > always being an upcoming scheduled event, otherwise the process will
> > > spin on the Services[] array and burn CPU for no reason.
> > > 
> > >> Looking at the code, I'm wondering whether we need the test for
> > >> scheduled events at all. As I understand, the child janitor is always
> > >> registered as an event running all 10 seconds, so there should be no
> > >> branch for schedule == NULL. Anyway, testing probably won't hurt, either.
> > > 
> > > The test for scheduled events is the difference between "block until a
> > > signal occurs or an fd becomes ready" and "block until a signal occurs,
> > > an fd becomes ready, or it is time to process a scheduled event".  If
> > > there's no scheduled events, we don't need to wake up in time for one,
> > > so we can just block until other work becomes available.
> > > 
> > > The only time we want to not block is if we know there's a scheduled
> > > event needing immediate processing (due now, or overdue).
> > > 
> > >> If I'm not mistaken, this actually also might have been an issue before
> > >> -- but don't ask me why it wasn't triggered. Probably because of good
> > >> luck and some interrupts (eg. the shutdown signal) being sufficient up
> > >> to now.
> > > 
> > > This wasn't a problem before, because the exiting children would wake up
> > > myselect(), be processed and cleaned up at the top of the next iteration
> > > of the outer loop, and we would only call myselect() again if there were
> > > still children that we were waiting for (who would wake us up again when
> > > they finished).
> > > 
> > > It's a problem with your patch because when we get a SIGCHLD we
> > > immediately myselect() again without cleaning up or checking whether
> > > there's even any more children to wait for.  (It can be moved along by
> > > the operator sending a few more signals, to manually exceed the 5 limit,
> > > but a clean shutdown shouldn't require this.)
> > > 
> > > I think the original timeout behaviour is correct. I think the patch for
> > > dealing with the repeated interrupts looks good, except for the shutdown
> > > problem; and I think the right way to deal with the shutdown problem is
> > > by explicitly testing for that, and not looping myselect() in that case.
> > > 
> > > Cheers,
> > > 
> > > ellie
> > > 
> > > On Fri, Apr 15, 2016, at 07:42 PM, Jens Erat wrote:
> > >> Hi Ellie,
> > >>
> > >>> It is failing our Cassandane tests, because the test suite is unable to
> > >>> cleanly shutdown the master process after each test finishes.
> > >>>
> > >>> Are you able to cleanly shut down the master process with a SIGTERM in
> > >>> your environment?  Can you cleanly shut one down that has very little
> > >>> activity on it?
> > >>
> > >> Indeed. We also had the idled-not-terminating-issue and still applied a
> > >> kill -9 after some time, so did not observe the issue yet.
> > >>
> > >>> I'm having a look at the way tvptr is set.  If the process is
> > >>> in_shutdown, then tvptr remains NULL (which tells myselect() to block
> > >>> indefinitely and never time out).  I'm not sure if this is going to turn
> > >>> out to be significant here.  I'm also not sure why we do this
> > >>> differently for this case.
> > >>>
> > >>> [...]
> > >>>
> > >>> What I think is happening is that, once shutdown is initiated, the only
> > >>> way myselect() can return is if one of the rfds becomes active, or if a
> > >>> signal arrives.  It will no longer time out.  But, once shutdown is
> > >>> initiated, at a certain point rfds are no longer becoming active
> > >>> (because all the children have themselves shut down), and signals are no
> > >>> longer occurring, so myselect() ends up blocked forever.
> > >>
> > >> I don't think we should block at all in this case. The master loop
> > >> should keep running and clean up. This is definitely a bug in my patch.
> > >>
> > >>
> > >>> Maybe the answer is to break out of the myselect loop after one
> > >>> iteration if we're in_shutdown, so that we don't block indefinitely
> > >>> waiting for iterations that will never occur.  I think just changing the
> > >>> loop condition from 
> > >>>     while (r == -1)
> > >>> to
> > >>>     while (!in_shutdown && r == -1)
> > >>> might do the trick.  
> > >>
> > >> I'd instead set a non-null, but zero timeout, so select will still clean
> > >> up the signal mask and query for FDs. I pushed an alternative fixup on
> > >> GitHub:
> > >>
> > >> https://github.com/JensErat/cyrus-imapd/commit/bc3bacf70261711a9c810107cddb081df65a552c
> > >>
> > >> Looking at the code, I'm wondering whether we need the test for
> > >> scheduled events at all. As I understand, the child janitor is always
> > >> registered as an event running all 10 seconds, so there should be no
> > >> branch for schedule == NULL. Anyway, testing probably won't hurt, either.
> > >>
> > >> Because the timeout is not only NULLed, but also set to zero as default,
> > >> we will not block any more iff
> > >>
> > >> - we're in shutdown or
> > >> - no schedule is set up (which should not happen, anyway, but we have a
> > >> safe fallback if the schedule is not completely set up/already torn down
> > >> on shutdown)
> > >>
> > >> If I'm not mistaken, this actually also might have been an issue before
> > >> -- but don't ask me why it wasn't triggered. Probably because of good
> > >> luck and some interrupts (eg. the shutdown signal) being sufficient up
> > >> to now.
> > >>
> > >>> This behaviour also suggests that a reasonable limit for interruptions
> > >>> might be one that scales with the number of child processes (rather than
> > >>> the arbitrary 5), but I don't think this alone would be enough to avoid
> > >>> the problem I just encountered.  If the remaining children all exit at
> > >>> the same time, and get processed by a single myselect(), then there's
> > >>> still n-1 loop iterations before shutdown would occur -- same problem.
> > >>
> > >> I'm not sure about this. This is the maximum number of SIGCHLDs, but we
> > >> also might fetch a SIGTERM on the go. Anyway: It is _very_ unlikely that
> > >> we get just another signal in-between the select calls. At least on our
> > >> setup, we never observed it at all, and this is a rather high-volume
> > >> service! Usually, the signals will arrive (but be blocked) somewhere
> > >> else in the master loop, and multiple signals are cleaned up with a
> > >> single (first) select statement.
> > >>
> > >> Anyway, nothing bad happens if we _sometimes_ pass over the message
> > >> handling, and at least we're now logging a warning after some retries.
> > >> If somebody has severe issues with this, he'll get aware of the reasons
> > >> by tweaking his configuration (for example by setting a more reasonable
> > >> -T parameter) or digging deeper.
> > >>
> > >> Regards,
> > >> Jens
> > >>
> > >> -- 
> > >> Jens Erat
> > >> Universität Konstanz
> > >> Kommunikations-, Infomations-, Medienzentrum (KIM)
> > >> Abteilung Basisdienste
> > >> D-78457 Konstanz
> > >> Mail: jens.erat at uni-konstanz.de
> > >>
> > >> Email had 1 attachment:
> > >> + smime.p7s
> > >>   7k (application/pkcs7-signature)
> > 
> > -- 
> > Jens Erat
> > Universität Konstanz
> > Kommunikations-, Infomations-, Medienzentrum (KIM)
> > Abteilung Basisdienste
> > D-78457 Konstanz
> > Mail: jens.erat at uni-konstanz.de
> > 
> > Email had 1 attachment:
> > + smime.p7s
> >   7k (application/pkcs7-signature)