Severe issue with high-volume IMAP server stalling, possible patch

Tue Apr 26 01:18:33 EDT 2016

Hi Jens,

> > I'm having a look at the way tvptr is set.  If the process is
> > in_shutdown, then tvptr remains NULL (which tells myselect() to block
> > indefinitely and never time out).  I'm not sure if this is going to turn
> > out to be significant here.  I'm also not sure why we do this
> > differently for this case.
> >
> > [...]
> >
> > What I think is happening is that, once shutdown is initiated, the only
> > way myselect() can return is if one of the rfds becomes active, or if a
> > signal arrives.  It will no longer time out.  But, once shutdown is
> > initiated, at a certain point rfds are no longer becoming active
> > (because all the children have themselves shut down), and signals are no
> > longer occurring, so myselect() ends up blocked forever.
> 
> I don't think we should block at all in this case. The master loop
> should keep running and clean up. This is definitely a bug in my patch.

I've read through the old/new code some more, and I don't think this is
a bug in your patch.

The logic prior to your patch was that myselect would get a NULL timeout
by default (and thus block indefinitely waiting for a signal or ready
file descriptors).  This makes sense: we want to block until there is
something to do.

If there were a schedule (and we were not shutting down), then we would
inspect it and set up a timeout, so that we would also unblock when it
was time to process that.  The timeout might be 0,0 if the next event
were already overdue, so that it could be processed immediately.  

If we were shutting down, we'd no longer be processing scheduled events,
so a scheduled event being due was not cause for a wakeup in this case.

Your patch preserves this behaviour, correctly, I think.

Where it causes problems is that it would now potentially call (and
block in) myselect multiple times in a row without doing any other
processing in between.  Which seems fine and good for the case we're
trying to fix, but turns out to be bad during shutdown.

During shutdown, we expect to get a bunch of SIGCHLDs for the exiting
children, which are reaped in the outer loop, before the myselect() call
(and after reaping, if there are no more children, we exit). But with
pselect, SIGCHLD is blocked until pselect temporarily unblocks it, so
these signals will always arrive during myselect().  This means that
during shutdown, myselect() will legitimately return -1/EINTR a number
of times, as the child processes exit, and we need to reap them and
decrement counters in order to exit cleanly.

With your patch, we no longer reap/decrement/exit on a signal unless
we've also coincidentally exceeded 5 consecutive EINTRs, and have thus
broken out of the myselect() loop -- which doesn't necessarily happen.

Understanding the cause, I'm even more sure that my fix for the shutdown
problem is correct: when we're in_shutdown, we /want/ any single EINTR
to send us back around the outer loop so we can reap children and maybe
exit.

> > Maybe the answer is to break out of the myselect loop after one
> > iteration if we're in_shutdown, so that we don't block indefinitely
> > waiting for iterations that will never occur.  I think just changing the
> > loop condition from 
> >     while (r == -1)
> > to
> >     while (!in_shutdown && r == -1)
> > might do the trick.  
> 
> I'd instead set a non-null, but zero timeout, so select will still clean
> up the signal mask and query for FDs. I pushed an alternative fixup on
> GitHub:
> 
> https://github.com/JensErat/cyrus-imapd/commit/bc3bacf70261711a9c810107cddb081df65a552c

This is a much bigger change.  It inverts the logic ("wait indefinitely
for something to do, unless schedule says to wake earlier" => "don't
wait, unless schedule says how long to wait").  It also depends on there
always being an upcoming scheduled event, otherwise the process will
spin on the Services[] array and burn CPU for no reason.

> Looking at the code, I'm wondering whether we need the test for
> scheduled events at all. As I understand, the child janitor is always
> registered as an event running all 10 seconds, so there should be no
> branch for schedule == NULL. Anyway, testing probably won't hurt, either.

The test for scheduled events is the difference between "block until a
signal occurs or an fd becomes ready" and "block until a signal occurs,
an fd becomes ready, or it is time to process a scheduled event".  If
there's no scheduled events, we don't need to wake up in time for one,
so we can just block until other work becomes available.

The only time we want to not block is if we know there's a scheduled
event needing immediate processing (due now, or overdue).

> If I'm not mistaken, this actually also might have been an issue before
> -- but don't ask me why it wasn't triggered. Probably because of good
> luck and some interrupts (eg. the shutdown signal) being sufficient up
> to now.

This wasn't a problem before, because the exiting children would wake up
myselect(), be processed and cleaned up at the top of the next iteration
of the outer loop, and we would only call myselect() again if there were
still children that we were waiting for (who would wake us up again when
they finished).

It's a problem with your patch because when we get a SIGCHLD we
immediately myselect() again without cleaning up or checking whether
there's even any more children to wait for.  (It can be moved along by
the operator sending a few more signals, to manually exceed the 5 limit,
but a clean shutdown shouldn't require this.)

I think the original timeout behaviour is correct. I think the patch for
dealing with the repeated interrupts looks good, except for the shutdown
problem; and I think the right way to deal with the shutdown problem is
by explicitly testing for that, and not looping myselect() in that case.

Cheers,

ellie

On Fri, Apr 15, 2016, at 07:42 PM, Jens Erat wrote:
> Hi Ellie,
> 
> > It is failing our Cassandane tests, because the test suite is unable to
> > cleanly shutdown the master process after each test finishes.
> > 
> > Are you able to cleanly shut down the master process with a SIGTERM in
> > your environment?  Can you cleanly shut one down that has very little
> > activity on it?
> 
> Indeed. We also had the idled-not-terminating-issue and still applied a
> kill -9 after some time, so did not observe the issue yet.
> 
> > I'm having a look at the way tvptr is set.  If the process is
> > in_shutdown, then tvptr remains NULL (which tells myselect() to block
> > indefinitely and never time out).  I'm not sure if this is going to turn
> > out to be significant here.  I'm also not sure why we do this
> > differently for this case.
> >
> > [...]
> >
> > What I think is happening is that, once shutdown is initiated, the only
> > way myselect() can return is if one of the rfds becomes active, or if a
> > signal arrives.  It will no longer time out.  But, once shutdown is
> > initiated, at a certain point rfds are no longer becoming active
> > (because all the children have themselves shut down), and signals are no
> > longer occurring, so myselect() ends up blocked forever.
> 
> I don't think we should block at all in this case. The master loop
> should keep running and clean up. This is definitely a bug in my patch.
> 
> 
> > Maybe the answer is to break out of the myselect loop after one
> > iteration if we're in_shutdown, so that we don't block indefinitely
> > waiting for iterations that will never occur.  I think just changing the
> > loop condition from 
> >     while (r == -1)
> > to
> >     while (!in_shutdown && r == -1)
> > might do the trick.  
> 
> I'd instead set a non-null, but zero timeout, so select will still clean
> up the signal mask and query for FDs. I pushed an alternative fixup on
> GitHub:
> 
> https://github.com/JensErat/cyrus-imapd/commit/bc3bacf70261711a9c810107cddb081df65a552c
> 
> Looking at the code, I'm wondering whether we need the test for
> scheduled events at all. As I understand, the child janitor is always
> registered as an event running all 10 seconds, so there should be no
> branch for schedule == NULL. Anyway, testing probably won't hurt, either.
> 
> Because the timeout is not only NULLed, but also set to zero as default,
> we will not block any more iff
> 
> - we're in shutdown or
> - no schedule is set up (which should not happen, anyway, but we have a
> safe fallback if the schedule is not completely set up/already torn down
> on shutdown)
> 
> If I'm not mistaken, this actually also might have been an issue before
> -- but don't ask me why it wasn't triggered. Probably because of good
> luck and some interrupts (eg. the shutdown signal) being sufficient up
> to now.
> 
> > This behaviour also suggests that a reasonable limit for interruptions
> > might be one that scales with the number of child processes (rather than
> > the arbitrary 5), but I don't think this alone would be enough to avoid
> > the problem I just encountered.  If the remaining children all exit at
> > the same time, and get processed by a single myselect(), then there's
> > still n-1 loop iterations before shutdown would occur -- same problem.
> 
> I'm not sure about this. This is the maximum number of SIGCHLDs, but we
> also might fetch a SIGTERM on the go. Anyway: It is _very_ unlikely that
> we get just another signal in-between the select calls. At least on our
> setup, we never observed it at all, and this is a rather high-volume
> service! Usually, the signals will arrive (but be blocked) somewhere
> else in the master loop, and multiple signals are cleaned up with a
> single (first) select statement.
> 
> Anyway, nothing bad happens if we _sometimes_ pass over the message
> handling, and at least we're now logging a warning after some retries.
> If somebody has severe issues with this, he'll get aware of the reasons
> by tweaking his configuration (for example by setting a more reasonable
> -T parameter) or digging deeper.
> 
> Regards,
> Jens
> 
> -- 
> Jens Erat
> Universität Konstanz
> Kommunikations-, Infomations-, Medienzentrum (KIM)
> Abteilung Basisdienste
> D-78457 Konstanz
> Mail: jens.erat at uni-konstanz.de
> 
> Email had 1 attachment:
> + smime.p7s
>   7k (application/pkcs7-signature)