Severe issue with high-volume IMAP server stalling, possible patch

Tue Apr 26 06:08:13 EDT 2016

Hi Ellie,

your argumentation seems legit. I'm not sure, though whether the master
should _ever_ wait forever (NULL-pointer). Maybe this should never
happen, but I'd propose _not_ waiting would be safer than waiting
forever in some weird cases. And it should never happen but during
startup and shutdown. Not waiting might result in some CPU cycles
wasted, but the general code might prove more robust to other minor issues.

You have the better understanding of the overall master loop logic and
what to do best, I'm fine with either solutions (as long as it works).

Regards,
Jens

Am 26.04.2016 um 07:18 schrieb ellie timoney:
> Hi Jens,
> 
>>> I'm having a look at the way tvptr is set.  If the process is
>>> in_shutdown, then tvptr remains NULL (which tells myselect() to block
>>> indefinitely and never time out).  I'm not sure if this is going to turn
>>> out to be significant here.  I'm also not sure why we do this
>>> differently for this case.
>>>
>>> [...]
>>>
>>> What I think is happening is that, once shutdown is initiated, the only
>>> way myselect() can return is if one of the rfds becomes active, or if a
>>> signal arrives.  It will no longer time out.  But, once shutdown is
>>> initiated, at a certain point rfds are no longer becoming active
>>> (because all the children have themselves shut down), and signals are no
>>> longer occurring, so myselect() ends up blocked forever.
>>
>> I don't think we should block at all in this case. The master loop
>> should keep running and clean up. This is definitely a bug in my patch.
> 
> I've read through the old/new code some more, and I don't think this is
> a bug in your patch.
> 
> The logic prior to your patch was that myselect would get a NULL timeout
> by default (and thus block indefinitely waiting for a signal or ready
> file descriptors).  This makes sense: we want to block until there is
> something to do.
> 
> If there were a schedule (and we were not shutting down), then we would
> inspect it and set up a timeout, so that we would also unblock when it
> was time to process that.  The timeout might be 0,0 if the next event
> were already overdue, so that it could be processed immediately.  
> 
> If we were shutting down, we'd no longer be processing scheduled events,
> so a scheduled event being due was not cause for a wakeup in this case.
> 
> Your patch preserves this behaviour, correctly, I think.
> 
> Where it causes problems is that it would now potentially call (and
> block in) myselect multiple times in a row without doing any other
> processing in between.  Which seems fine and good for the case we're
> trying to fix, but turns out to be bad during shutdown.
> 
> During shutdown, we expect to get a bunch of SIGCHLDs for the exiting
> children, which are reaped in the outer loop, before the myselect() call
> (and after reaping, if there are no more children, we exit). But with
> pselect, SIGCHLD is blocked until pselect temporarily unblocks it, so
> these signals will always arrive during myselect().  This means that
> during shutdown, myselect() will legitimately return -1/EINTR a number
> of times, as the child processes exit, and we need to reap them and
> decrement counters in order to exit cleanly.
> 
> With your patch, we no longer reap/decrement/exit on a signal unless
> we've also coincidentally exceeded 5 consecutive EINTRs, and have thus
> broken out of the myselect() loop -- which doesn't necessarily happen.
> 
> Understanding the cause, I'm even more sure that my fix for the shutdown
> problem is correct: when we're in_shutdown, we /want/ any single EINTR
> to send us back around the outer loop so we can reap children and maybe
> exit.
> 
>>> Maybe the answer is to break out of the myselect loop after one
>>> iteration if we're in_shutdown, so that we don't block indefinitely
>>> waiting for iterations that will never occur.  I think just changing the
>>> loop condition from 
>>>     while (r == -1)
>>> to
>>>     while (!in_shutdown && r == -1)
>>> might do the trick.  
>>
>> I'd instead set a non-null, but zero timeout, so select will still clean
>> up the signal mask and query for FDs. I pushed an alternative fixup on
>> GitHub:
>>
>> https://github.com/JensErat/cyrus-imapd/commit/bc3bacf70261711a9c810107cddb081df65a552c
> 
> This is a much bigger change.  It inverts the logic ("wait indefinitely
> for something to do, unless schedule says to wake earlier" => "don't
> wait, unless schedule says how long to wait").  It also depends on there
> always being an upcoming scheduled event, otherwise the process will
> spin on the Services[] array and burn CPU for no reason.
> 
>> Looking at the code, I'm wondering whether we need the test for
>> scheduled events at all. As I understand, the child janitor is always
>> registered as an event running all 10 seconds, so there should be no
>> branch for schedule == NULL. Anyway, testing probably won't hurt, either.
> 
> The test for scheduled events is the difference between "block until a
> signal occurs or an fd becomes ready" and "block until a signal occurs,
> an fd becomes ready, or it is time to process a scheduled event".  If
> there's no scheduled events, we don't need to wake up in time for one,
> so we can just block until other work becomes available.
> 
> The only time we want to not block is if we know there's a scheduled
> event needing immediate processing (due now, or overdue).
> 
>> If I'm not mistaken, this actually also might have been an issue before
>> -- but don't ask me why it wasn't triggered. Probably because of good
>> luck and some interrupts (eg. the shutdown signal) being sufficient up
>> to now.
> 
> This wasn't a problem before, because the exiting children would wake up
> myselect(), be processed and cleaned up at the top of the next iteration
> of the outer loop, and we would only call myselect() again if there were
> still children that we were waiting for (who would wake us up again when
> they finished).
> 
> It's a problem with your patch because when we get a SIGCHLD we
> immediately myselect() again without cleaning up or checking whether
> there's even any more children to wait for.  (It can be moved along by
> the operator sending a few more signals, to manually exceed the 5 limit,
> but a clean shutdown shouldn't require this.)
> 
> I think the original timeout behaviour is correct. I think the patch for
> dealing with the repeated interrupts looks good, except for the shutdown
> problem; and I think the right way to deal with the shutdown problem is
> by explicitly testing for that, and not looping myselect() in that case.
> 
> Cheers,
> 
> ellie
> 
> On Fri, Apr 15, 2016, at 07:42 PM, Jens Erat wrote:
>> Hi Ellie,
>>
>>> It is failing our Cassandane tests, because the test suite is unable to
>>> cleanly shutdown the master process after each test finishes.
>>>
>>> Are you able to cleanly shut down the master process with a SIGTERM in
>>> your environment?  Can you cleanly shut one down that has very little
>>> activity on it?
>>
>> Indeed. We also had the idled-not-terminating-issue and still applied a
>> kill -9 after some time, so did not observe the issue yet.
>>
>>> I'm having a look at the way tvptr is set.  If the process is
>>> in_shutdown, then tvptr remains NULL (which tells myselect() to block
>>> indefinitely and never time out).  I'm not sure if this is going to turn
>>> out to be significant here.  I'm also not sure why we do this
>>> differently for this case.
>>>
>>> [...]
>>>
>>> What I think is happening is that, once shutdown is initiated, the only
>>> way myselect() can return is if one of the rfds becomes active, or if a
>>> signal arrives.  It will no longer time out.  But, once shutdown is
>>> initiated, at a certain point rfds are no longer becoming active
>>> (because all the children have themselves shut down), and signals are no
>>> longer occurring, so myselect() ends up blocked forever.
>>
>> I don't think we should block at all in this case. The master loop
>> should keep running and clean up. This is definitely a bug in my patch.
>>
>>
>>> Maybe the answer is to break out of the myselect loop after one
>>> iteration if we're in_shutdown, so that we don't block indefinitely
>>> waiting for iterations that will never occur.  I think just changing the
>>> loop condition from 
>>>     while (r == -1)
>>> to
>>>     while (!in_shutdown && r == -1)
>>> might do the trick.  
>>
>> I'd instead set a non-null, but zero timeout, so select will still clean
>> up the signal mask and query for FDs. I pushed an alternative fixup on
>> GitHub:
>>
>> https://github.com/JensErat/cyrus-imapd/commit/bc3bacf70261711a9c810107cddb081df65a552c
>>
>> Looking at the code, I'm wondering whether we need the test for
>> scheduled events at all. As I understand, the child janitor is always
>> registered as an event running all 10 seconds, so there should be no
>> branch for schedule == NULL. Anyway, testing probably won't hurt, either.
>>
>> Because the timeout is not only NULLed, but also set to zero as default,
>> we will not block any more iff
>>
>> - we're in shutdown or
>> - no schedule is set up (which should not happen, anyway, but we have a
>> safe fallback if the schedule is not completely set up/already torn down
>> on shutdown)
>>
>> If I'm not mistaken, this actually also might have been an issue before
>> -- but don't ask me why it wasn't triggered. Probably because of good
>> luck and some interrupts (eg. the shutdown signal) being sufficient up
>> to now.
>>
>>> This behaviour also suggests that a reasonable limit for interruptions
>>> might be one that scales with the number of child processes (rather than
>>> the arbitrary 5), but I don't think this alone would be enough to avoid
>>> the problem I just encountered.  If the remaining children all exit at
>>> the same time, and get processed by a single myselect(), then there's
>>> still n-1 loop iterations before shutdown would occur -- same problem.
>>
>> I'm not sure about this. This is the maximum number of SIGCHLDs, but we
>> also might fetch a SIGTERM on the go. Anyway: It is _very_ unlikely that
>> we get just another signal in-between the select calls. At least on our
>> setup, we never observed it at all, and this is a rather high-volume
>> service! Usually, the signals will arrive (but be blocked) somewhere
>> else in the master loop, and multiple signals are cleaned up with a
>> single (first) select statement.
>>
>> Anyway, nothing bad happens if we _sometimes_ pass over the message
>> handling, and at least we're now logging a warning after some retries.
>> If somebody has severe issues with this, he'll get aware of the reasons
>> by tweaking his configuration (for example by setting a more reasonable
>> -T parameter) or digging deeper.
>>
>> Regards,
>> Jens
>>
>> -- 
>> Jens Erat
>> Universität Konstanz
>> Kommunikations-, Infomations-, Medienzentrum (KIM)
>> Abteilung Basisdienste
>> D-78457 Konstanz
>> Mail: jens.erat at uni-konstanz.de
>>
>> Email had 1 attachment:
>> + smime.p7s
>>   7k (application/pkcs7-signature)

-- 
Jens Erat
Universität Konstanz
Kommunikations-, Infomations-, Medienzentrum (KIM)
Abteilung Basisdienste
D-78457 Konstanz
Mail: jens.erat at uni-konstanz.de

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4913 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.andrew.cmu.edu/pipermail/cyrus-devel/attachments/20160426/ca2d6452/attachment-0001.p7s>