Severe issue with high-volume IMAP server stalling, possible patch

Tue Apr 26 22:20:00 EDT 2016

Hi Jens,

> your argumentation seems legit.

Thanks :)

> I'm not sure, though whether the master
> should _ever_ wait forever (NULL-pointer). Maybe this should never
> happen, but I'd propose _not_ waiting would be safer than waiting
> forever in some weird cases. And it should never happen but during
> startup and shutdown. Not waiting might result in some CPU cycles
> wasted, but the general code might prove more robust to other minor
> issues.

I can see a plausible argument here both ways: on the one hand, if we
never use a NULL timeout, then we'll never hang like we did during
shutdown in this particular case.  On the other hand, waking up master
periodically when it has no work to do will potentially degrade
performance for the actual services that are trying to run.  

Also, hanging at shutdown in the way we did helped us identify the
problem really quickly and fix it before it made it into a release.

I'm not really sure how frequently master is likely to wake up during
"normal" operation, but I expect it varies a lot by environment and
configuration.  It's hard to estimate where a reasonable compromise
between "don't block indefinitely" and "don't block at all" might lie.

I could see the addition of some kind of hang-notification wakeup,
whereby if master has been asleep for longer than some configurable
duration, it wakes up and logs a warning.  The natural spot to do this,
though, would be as another scheduled event, which wouldn't do anything
in the particular case of "no schedule" that I think you're interested
in.  So to be useful at all it would require bigger changes to the tvptr
calculation.

I've had a look through the history of master.c, and the timeout has
been NULL since the master process was conceived.  Only the addition of
the event scheduler introduced the optional timeout.  Given it's been
working well enough like this basically forever, I don't want to change
it on our stable branch.  I would consider a feature like the above for
inclusion in 3.0, though, if anyone was interested enough to contribute
a patch?

> You have the better understanding of the overall master loop logic and
> what to do best

You flatter me, but I'm figuring this out as I go too :)

> I'm fine with either solutions (as long as it works).

I'll push what we've got (your patch plus my shutdown fix) to the 2.5
branch for inclusion in the next release, and also forward port it to
the master branch...

Though: I've just noticed another possible issue.  We might get stuck in
that inner loop if select keeps raising EAGAIN (it has no counter, and
the "break" statement breaks out of the switch, not the inner loop).

My system's documentation for select claims that "some other UNIX
systems" can raise EGAIN (despite POSIX not specifying this error for
select), and suggests portable programs should retry as if for EINTR in
this case.

I guess EAGAIN didn't occur in your profiling, but maybe your system
can't raise EGAIN from select either?

The old (prior to patch) behaviour for EAGAIN was to resume the outer
loop from the top, same as EINTR.

Right now I'm not sure if we should try to keep the old behaviour (retry
from top on EGAIN), try to keep it the same as EINTR (so give it a
counter and a limit), or do something new (like just pretend it was a
timeout, like we now do after too many interrupts, so that the end of
loop processing can occur).  Any thoughts?

I'll read through the thread again and see if it becomes clear, and
won't push the patch to the 2.5 branch just yet in case this changes it.

Cheers,

ellie

On Tue, Apr 26, 2016, at 08:08 PM, Jens Erat wrote:
> Hi Ellie,
> 
> your argumentation seems legit. I'm not sure, though whether the master
> should _ever_ wait forever (NULL-pointer). Maybe this should never
> happen, but I'd propose _not_ waiting would be safer than waiting
> forever in some weird cases. And it should never happen but during
> startup and shutdown. Not waiting might result in some CPU cycles
> wasted, but the general code might prove more robust to other minor
> issues.
> 
> You have the better understanding of the overall master loop logic and
> what to do best, I'm fine with either solutions (as long as it works).
> 
> Regards,
> Jens
> 
> Am 26.04.2016 um 07:18 schrieb ellie timoney:
> > Hi Jens,
> > 
> >>> I'm having a look at the way tvptr is set.  If the process is
> >>> in_shutdown, then tvptr remains NULL (which tells myselect() to block
> >>> indefinitely and never time out).  I'm not sure if this is going to turn
> >>> out to be significant here.  I'm also not sure why we do this
> >>> differently for this case.
> >>>
> >>> [...]
> >>>
> >>> What I think is happening is that, once shutdown is initiated, the only
> >>> way myselect() can return is if one of the rfds becomes active, or if a
> >>> signal arrives.  It will no longer time out.  But, once shutdown is
> >>> initiated, at a certain point rfds are no longer becoming active
> >>> (because all the children have themselves shut down), and signals are no
> >>> longer occurring, so myselect() ends up blocked forever.
> >>
> >> I don't think we should block at all in this case. The master loop
> >> should keep running and clean up. This is definitely a bug in my patch.
> > 
> > I've read through the old/new code some more, and I don't think this is
> > a bug in your patch.
> > 
> > The logic prior to your patch was that myselect would get a NULL timeout
> > by default (and thus block indefinitely waiting for a signal or ready
> > file descriptors).  This makes sense: we want to block until there is
> > something to do.
> > 
> > If there were a schedule (and we were not shutting down), then we would
> > inspect it and set up a timeout, so that we would also unblock when it
> > was time to process that.  The timeout might be 0,0 if the next event
> > were already overdue, so that it could be processed immediately.  
> > 
> > If we were shutting down, we'd no longer be processing scheduled events,
> > so a scheduled event being due was not cause for a wakeup in this case.
> > 
> > Your patch preserves this behaviour, correctly, I think.
> > 
> > Where it causes problems is that it would now potentially call (and
> > block in) myselect multiple times in a row without doing any other
> > processing in between.  Which seems fine and good for the case we're
> > trying to fix, but turns out to be bad during shutdown.
> > 
> > During shutdown, we expect to get a bunch of SIGCHLDs for the exiting
> > children, which are reaped in the outer loop, before the myselect() call
> > (and after reaping, if there are no more children, we exit). But with
> > pselect, SIGCHLD is blocked until pselect temporarily unblocks it, so
> > these signals will always arrive during myselect().  This means that
> > during shutdown, myselect() will legitimately return -1/EINTR a number
> > of times, as the child processes exit, and we need to reap them and
> > decrement counters in order to exit cleanly.
> > 
> > With your patch, we no longer reap/decrement/exit on a signal unless
> > we've also coincidentally exceeded 5 consecutive EINTRs, and have thus
> > broken out of the myselect() loop -- which doesn't necessarily happen.
> > 
> > Understanding the cause, I'm even more sure that my fix for the shutdown
> > problem is correct: when we're in_shutdown, we /want/ any single EINTR
> > to send us back around the outer loop so we can reap children and maybe
> > exit.
> > 
> >>> Maybe the answer is to break out of the myselect loop after one
> >>> iteration if we're in_shutdown, so that we don't block indefinitely
> >>> waiting for iterations that will never occur.  I think just changing the
> >>> loop condition from 
> >>>     while (r == -1)
> >>> to
> >>>     while (!in_shutdown && r == -1)
> >>> might do the trick.  
> >>
> >> I'd instead set a non-null, but zero timeout, so select will still clean
> >> up the signal mask and query for FDs. I pushed an alternative fixup on
> >> GitHub:
> >>
> >> https://github.com/JensErat/cyrus-imapd/commit/bc3bacf70261711a9c810107cddb081df65a552c
> > 
> > This is a much bigger change.  It inverts the logic ("wait indefinitely
> > for something to do, unless schedule says to wake earlier" => "don't
> > wait, unless schedule says how long to wait").  It also depends on there
> > always being an upcoming scheduled event, otherwise the process will
> > spin on the Services[] array and burn CPU for no reason.
> > 
> >> Looking at the code, I'm wondering whether we need the test for
> >> scheduled events at all. As I understand, the child janitor is always
> >> registered as an event running all 10 seconds, so there should be no
> >> branch for schedule == NULL. Anyway, testing probably won't hurt, either.
> > 
> > The test for scheduled events is the difference between "block until a
> > signal occurs or an fd becomes ready" and "block until a signal occurs,
> > an fd becomes ready, or it is time to process a scheduled event".  If
> > there's no scheduled events, we don't need to wake up in time for one,
> > so we can just block until other work becomes available.
> > 
> > The only time we want to not block is if we know there's a scheduled
> > event needing immediate processing (due now, or overdue).
> > 
> >> If I'm not mistaken, this actually also might have been an issue before
> >> -- but don't ask me why it wasn't triggered. Probably because of good
> >> luck and some interrupts (eg. the shutdown signal) being sufficient up
> >> to now.
> > 
> > This wasn't a problem before, because the exiting children would wake up
> > myselect(), be processed and cleaned up at the top of the next iteration
> > of the outer loop, and we would only call myselect() again if there were
> > still children that we were waiting for (who would wake us up again when
> > they finished).
> > 
> > It's a problem with your patch because when we get a SIGCHLD we
> > immediately myselect() again without cleaning up or checking whether
> > there's even any more children to wait for.  (It can be moved along by
> > the operator sending a few more signals, to manually exceed the 5 limit,
> > but a clean shutdown shouldn't require this.)
> > 
> > I think the original timeout behaviour is correct. I think the patch for
> > dealing with the repeated interrupts looks good, except for the shutdown
> > problem; and I think the right way to deal with the shutdown problem is
> > by explicitly testing for that, and not looping myselect() in that case.
> > 
> > Cheers,
> > 
> > ellie
> > 
> > On Fri, Apr 15, 2016, at 07:42 PM, Jens Erat wrote:
> >> Hi Ellie,
> >>
> >>> It is failing our Cassandane tests, because the test suite is unable to
> >>> cleanly shutdown the master process after each test finishes.
> >>>
> >>> Are you able to cleanly shut down the master process with a SIGTERM in
> >>> your environment?  Can you cleanly shut one down that has very little
> >>> activity on it?
> >>
> >> Indeed. We also had the idled-not-terminating-issue and still applied a
> >> kill -9 after some time, so did not observe the issue yet.
> >>
> >>> I'm having a look at the way tvptr is set.  If the process is
> >>> in_shutdown, then tvptr remains NULL (which tells myselect() to block
> >>> indefinitely and never time out).  I'm not sure if this is going to turn
> >>> out to be significant here.  I'm also not sure why we do this
> >>> differently for this case.
> >>>
> >>> [...]
> >>>
> >>> What I think is happening is that, once shutdown is initiated, the only
> >>> way myselect() can return is if one of the rfds becomes active, or if a
> >>> signal arrives.  It will no longer time out.  But, once shutdown is
> >>> initiated, at a certain point rfds are no longer becoming active
> >>> (because all the children have themselves shut down), and signals are no
> >>> longer occurring, so myselect() ends up blocked forever.
> >>
> >> I don't think we should block at all in this case. The master loop
> >> should keep running and clean up. This is definitely a bug in my patch.
> >>
> >>
> >>> Maybe the answer is to break out of the myselect loop after one
> >>> iteration if we're in_shutdown, so that we don't block indefinitely
> >>> waiting for iterations that will never occur.  I think just changing the
> >>> loop condition from 
> >>>     while (r == -1)
> >>> to
> >>>     while (!in_shutdown && r == -1)
> >>> might do the trick.  
> >>
> >> I'd instead set a non-null, but zero timeout, so select will still clean
> >> up the signal mask and query for FDs. I pushed an alternative fixup on
> >> GitHub:
> >>
> >> https://github.com/JensErat/cyrus-imapd/commit/bc3bacf70261711a9c810107cddb081df65a552c
> >>
> >> Looking at the code, I'm wondering whether we need the test for
> >> scheduled events at all. As I understand, the child janitor is always
> >> registered as an event running all 10 seconds, so there should be no
> >> branch for schedule == NULL. Anyway, testing probably won't hurt, either.
> >>
> >> Because the timeout is not only NULLed, but also set to zero as default,
> >> we will not block any more iff
> >>
> >> - we're in shutdown or
> >> - no schedule is set up (which should not happen, anyway, but we have a
> >> safe fallback if the schedule is not completely set up/already torn down
> >> on shutdown)
> >>
> >> If I'm not mistaken, this actually also might have been an issue before
> >> -- but don't ask me why it wasn't triggered. Probably because of good
> >> luck and some interrupts (eg. the shutdown signal) being sufficient up
> >> to now.
> >>
> >>> This behaviour also suggests that a reasonable limit for interruptions
> >>> might be one that scales with the number of child processes (rather than
> >>> the arbitrary 5), but I don't think this alone would be enough to avoid
> >>> the problem I just encountered.  If the remaining children all exit at
> >>> the same time, and get processed by a single myselect(), then there's
> >>> still n-1 loop iterations before shutdown would occur -- same problem.
> >>
> >> I'm not sure about this. This is the maximum number of SIGCHLDs, but we
> >> also might fetch a SIGTERM on the go. Anyway: It is _very_ unlikely that
> >> we get just another signal in-between the select calls. At least on our
> >> setup, we never observed it at all, and this is a rather high-volume
> >> service! Usually, the signals will arrive (but be blocked) somewhere
> >> else in the master loop, and multiple signals are cleaned up with a
> >> single (first) select statement.
> >>
> >> Anyway, nothing bad happens if we _sometimes_ pass over the message
> >> handling, and at least we're now logging a warning after some retries.
> >> If somebody has severe issues with this, he'll get aware of the reasons
> >> by tweaking his configuration (for example by setting a more reasonable
> >> -T parameter) or digging deeper.
> >>
> >> Regards,
> >> Jens
> >>
> >> -- 
> >> Jens Erat
> >> Universität Konstanz
> >> Kommunikations-, Infomations-, Medienzentrum (KIM)
> >> Abteilung Basisdienste
> >> D-78457 Konstanz
> >> Mail: jens.erat at uni-konstanz.de
> >>
> >> Email had 1 attachment:
> >> + smime.p7s
> >>   7k (application/pkcs7-signature)
> 
> -- 
> Jens Erat
> Universität Konstanz
> Kommunikations-, Infomations-, Medienzentrum (KIM)
> Abteilung Basisdienste
> D-78457 Konstanz
> Mail: jens.erat at uni-konstanz.de
> 
> Email had 1 attachment:
> + smime.p7s
>   7k (application/pkcs7-signature)