Severe issue with high-volume IMAP server stalling, possible patch

Sat Feb 6 07:53:23 EST 2016

I agree on the fdset part. I guess the best (because of simple and easy
to understand) solution is what you already proposed: simply go on
`select`ing.

As far as I can see, the only consequences of your original proposal,
putting a small loop around the select statement:

- SIGCHLD will be handled slightly later -- I think that's fine
- SIGHUP will be handled later -- so the administrator has to wait for
ten seconds in the worst case when reloading services. I guess we can
also live with that pretty well.
- Shutdown signals might be late for ten seconds worst case. Again, I
think that's fine.

Or did I miss anything of relevance here?

Regards,
Jens

Am 06.02.2016 um 13:00 schrieb Bron Gondwana:
> I could buy repeating the whole thing for EINTR but not for EAGAIN.  Obviously everything is fine under non-heavy load no matter what.
> 
> I'm pretty sure your patch is wrong, because it will process the fdset that the select was supposed to fill even if the select returned an error.
> 
> You may have convinced me that my patch is wrong as well, and that we have to always check for the in_shutdown and sigquit magic in that tight loop in case it was a signal that needs to be responded to.
> 
> Bron.
> 
> On Sat, Feb 6, 2016, at 20:52, Jens Erat wrote:
>> Hi Bron,
>>
>> thank you for the fast reply.
>>
>>> One thing I would suggest is running something like nginx in front of
>> Cyrus, even if you only need it for authentication checks.
>>
>> We'll discuss this internally.
>>
>>> I believe this is the correct patch. If the select was interrupted
>> then it needs to be re-run, but there is no need to repeat all the other
>> work, so:
>>>
>>> - r = myselect(maxfd, &rfds, NULL, NULL, tvptr);
>>> - if (r == -1 && errno == EAGAIN) continue;
>>> - if (r == -1 && errno == EINTR) continue;
>>> + do {
>>> + r = myselect(maxfd, &rfds, NULL, NULL, tvptr);
>>> + } while (r == -1 && (errno == EAGAIN || errno == EINTR));
>>
>> From my very basic understanding on EAGAIN, this might be the right
>> thing to do. I'm not sure about EINTR, though.
>>
>> Are you sure we want to keep on waiting when being interrupted? I'm
>> definitely not experienced as a systems and C programmer, but as far as
>> I know EINTR will only/usually be thrown if a signal was received. Is
>> receiving a signal not always an indication the master has something to
>> do, and should go on doing whatever this might be instead of waiting for
>> more messages?
>>
>> Regards,
>> Jens
>>
>> PS: I've added some people involved in the CC interested in the
>> conversation but not on the list.
>>
>> Am 06.02.2016 um 04:19 schrieb Bron Gondwana via Cyrus-devel:
>>> Thanks for the report.  I see your point - there's definitely a DOS possibility here.
>>>
>>> One thing I would suggest is running something like nginx in front of Cyrus, even if you only need it for authentication checks.
>>>
>>> I believe this is the correct patch.  If the select was interrupted then it needs to be re-run, but there is no need to repeat all the other work, so:
>>>
>>> -        r = myselect(maxfd, &rfds, NULL, NULL, tvptr);
>>> -        if (r == -1 && errno == EAGAIN) continue;
>>> -        if (r == -1 && errno == EINTR) continue;
>>> +        do {
>>> +            r = myselect(maxfd, &rfds, NULL, NULL, tvptr);
>>> +        } while (r == -1 && (errno == EAGAIN || errno == EINTR));
>>>
>>> I'm quite confident this is correct, so I will roll it out to FastMail's testing servers today.  If they are happy, I'll push it upstream with backports for 2.5 and 2.4.
>>>
>>> Cheers,
>>>
>>> Bron.
>>>
>>> On Sat, Feb 6, 2016, at 03:58, Jens Erat via Cyrus-devel wrote:
>>>> Dear Cyrus maintainers,
>>>>
>>>> the very short version: we probably found an issue resulting in the
>>>> master stalling high-volume IMAP servers. A possible patch is attached,
>>>> but needs some more discussion.
>>>>
>>>> - - -
>>>>
>>>> At the University of Konstanz, we're running a rather large IMAP server
>>>> for about 18k users on some large Oracle/Solaris machines. We don't have
>>>> a murder/cluster. The machine is overprovisioned to last for some years
>>>> and is running blazingly fast during normal operation. We're running on
>>>> Solaris 11 and ZFS, the mailboxes are stored on a network storage
>>>> attached through Fiberchannel (and we don't observe any I/O issues). The
>>>> dual socket machine is equipped with 256GB of memory. We tried both
>>>> compiling with GnuCC 4.8 as well as Sun CC (from Solaris Studio).
>>>>
>>>> Then, from time to time, it completely stalled -- usually at times with
>>>> high fork rate. We observed this issue to happen in the morning when
>>>> lots of people started reading their mails, we experienced it during a
>>>> DDOS attack on our network, which made the firewall drop lots of
>>>> connections (and the mail clients trying to reconnect instantly).
>>>>
>>>> As a result, the mail server was still running fine for everybody still
>>>> holding a connection, but denied service for pretty much all new
>>>> connecting clients. We had to restart the whole Cyrus IMAPd service to
>>>> recover from this issue.
>>>>
>>>>
>>>> We started getting a clue of what's going wrong when we actually added
>>>> debug information and some ctable verifications, as we expected
>>>> something would be wrong with how the active/ready workers are counted.
>>>> In the end, we also took time checkpoints during the master loop to see
>>>> _where_ it is stuck. In the end, it seemed, this was while jumping back
>>>> from the very last command (getting a timestamp in the loop) to the very
>>>> first (also getting a timestamp). The code pretty much looked like:
>>>>
>>>>    struct timeval profiling[11];
>>>>    gettimeofday(&profiling[10], 0); // LOOP DONE
>>>>    for (;;) {
>>>>        gettimeofday(&profiling[0], 0); // INIT
>>>>
>>>>        // Start scheduled tasks
>>>>
>>>>        gettimeofday(&profiling[1], 0); // SCHEDULE
>>>>
>>>>        // Other master loop tasks: spawn, message handling, ...
>>>>
>>>>        gettimeofday(&profiling[9], 0); // SNMP_ALARMS
>>>>
>>>>        syslog(LOG_INFO, "MASTERLOOP_PROFILING:
>>>> %f,%f,%f,%f,%f,%f,%f,%f,%f,%f",
>>>>            timesub(&profiling[0], &profiling[1]),
>>>>            timesub(&profiling[0], &profiling[1]),
>>>>            timesub(&profiling[1], &profiling[2]),
>>>>            timesub(&profiling[2], &profiling[3]),
>>>>            timesub(&profiling[3], &profiling[4]),
>>>>            timesub(&profiling[4], &profiling[5]),
>>>>            timesub(&profiling[5], &profiling[6]),
>>>>            timesub(&profiling[6], &profiling[7]),
>>>>            timesub(&profiling[7], &profiling[8]),
>>>>            timesub(&profiling[8], &profiling[9]));
>>>>
>>>>        gettimeofday(&profiling[10], 0); // LOOP DONE
>>>>    }
>>>>
>>>> The results really puzzled us. What might be the reason jumping back
>>>> from the end of a loop to the beginning took multiple minutes? In the
>>>> end, by adding another log line in the _beginning_ of the loop we
>>>> realized that the loop was indeed running very often -- but simply did
>>>> never complete. `profiling[10]` did never change when the server was stuck.
>>>>
>>>> Knowing this the problem was obvious: when `select`ing the sockets and
>>>> waiting for messages, the server got interrupted all the time. This
>>>> results in the whole loop starting over from scratch, passing over
>>>> message handling and thus accounting of available worker daemons, too
>>>> few of them getting spawned.
>>>>
>>>>         r = myselect(maxfd, &rfds, NULL, NULL, tvptr);
>>>>         if (r == -1 && errno == EAGAIN) continue;
>>>>         if (r == -1 && errno == EINTR) continue;
>>>>         if (r == -1) {
>>>>             /* uh oh */
>>>>             fatalf(1, "select failed: %m");
>>>>         }
>>>>
>>>> This is a common pattern, if you want to `sleep` or `select`: try to do
>>>> it, and if you get interrupted or informed to try again, `continue` and
>>>> do again. But usually, this is enclosed in a tiny loop.
>>>>
>>>> On a test machine, we were not able to completely reproduce the issue,
>>>> but have very well been able to observe the message handling part of the
>>>> loop not running for a dozen or more master loop runs. A small patch
>>>> replacing this with going on with the current loop instead of starting
>>>> over from scratch completely resolved the issue (for `master/master.c`
>>>> in Cyrus IMAP 2.5.7 release tar):
>>>>
>>>> 2481,2483c2481
>>>> < 	if (r == -1 && errno == EAGAIN) continue;
>>>> < 	if (r == -1 && errno == EINTR) continue;
>>>> < 	if (r == -1) {
>>>> ---
>>>>> 	if (r == -1 && !(errno == EAGAIN || errno == EINTR)) {
>>>>
>>>> This ran fine on our test setup, but we're still scared patching one of
>>>> the most central flows of logic in a very relevant service.
>>>>
>>>> What is your opinion:
>>>>
>>>> - Message handling was sometimes stuck for minutes. I guess we can agree
>>>> this should never happen on a high volume server.
>>>> - Might this be related to the issues we observed?
>>>> - Are there any consequences to the subsequent code (message handling
>>>> and reaping children) if we do _not_ start from scratch here?
>>>>
>>>> If we get a "this looks fine, and you won't horribly mess up" from the
>>>> developers, we will have a try and patch the production machine as proposed.
>>>>
>>>> Kind regards from Lake Constance, Germany,
>>>> Jens Erat
>>>>
>>>>
>>>> -- 
>>>> Jens Erat
>>>> Universität Konstanz
>>>> Kommunikations-, Infomations-, Medienzentrum (KIM)
>>>> Abteilung Basisdienste
>>>> D-78457 Konstanz
>>>> Mail: jens.erat at uni-konstanz.de
>>>>
>>>> Email had 1 attachment:
>>>> + smime.p7s
>>>>   7k (application/pkcs7-signature)
>>>
>>>
>>
>> -- 
>> Jens Erat
>> Universität Konstanz
>> Kommunikations-, Infomations-, Medienzentrum (KIM)
>> Abteilung Basisdienste
>> D-78457 Konstanz
>> Mail: jens.erat at uni-konstanz.de
>>
>> Email had 1 attachment:
>> + smime.p7s
>>   7k (application/pkcs7-signature)
> 
> 

-- 
Jens Erat
Universität Konstanz
Kommunikations-, Infomations-, Medienzentrum (KIM)
Abteilung Basisdienste
D-78457 Konstanz
Mail: jens.erat at uni-konstanz.de

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4913 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.andrew.cmu.edu/pipermail/cyrus-devel/attachments/20160206/4b34e2cd/attachment-0001.p7s>