Severe issue with high-volume IMAP server stalling, possible patch

Fri Feb 5 11:58:02 EST 2016

Dear Cyrus maintainers,

the very short version: we probably found an issue resulting in the
master stalling high-volume IMAP servers. A possible patch is attached,
but needs some more discussion.

- - -

At the University of Konstanz, we're running a rather large IMAP server
for about 18k users on some large Oracle/Solaris machines. We don't have
a murder/cluster. The machine is overprovisioned to last for some years
and is running blazingly fast during normal operation. We're running on
Solaris 11 and ZFS, the mailboxes are stored on a network storage
attached through Fiberchannel (and we don't observe any I/O issues). The
dual socket machine is equipped with 256GB of memory. We tried both
compiling with GnuCC 4.8 as well as Sun CC (from Solaris Studio).

Then, from time to time, it completely stalled -- usually at times with
high fork rate. We observed this issue to happen in the morning when
lots of people started reading their mails, we experienced it during a
DDOS attack on our network, which made the firewall drop lots of
connections (and the mail clients trying to reconnect instantly).

As a result, the mail server was still running fine for everybody still
holding a connection, but denied service for pretty much all new
connecting clients. We had to restart the whole Cyrus IMAPd service to
recover from this issue.

We started getting a clue of what's going wrong when we actually added
debug information and some ctable verifications, as we expected
something would be wrong with how the active/ready workers are counted.
In the end, we also took time checkpoints during the master loop to see
_where_ it is stuck. In the end, it seemed, this was while jumping back
from the very last command (getting a timestamp in the loop) to the very
first (also getting a timestamp). The code pretty much looked like:

   struct timeval profiling[11];
   gettimeofday(&profiling[10], 0); // LOOP DONE
   for (;;) {
       gettimeofday(&profiling[0], 0); // INIT

       // Start scheduled tasks

       gettimeofday(&profiling[1], 0); // SCHEDULE

       // Other master loop tasks: spawn, message handling, ...

       gettimeofday(&profiling[9], 0); // SNMP_ALARMS

       syslog(LOG_INFO, "MASTERLOOP_PROFILING:
%f,%f,%f,%f,%f,%f,%f,%f,%f,%f",
           timesub(&profiling[0], &profiling[1]),
           timesub(&profiling[0], &profiling[1]),
           timesub(&profiling[1], &profiling[2]),
           timesub(&profiling[2], &profiling[3]),
           timesub(&profiling[3], &profiling[4]),
           timesub(&profiling[4], &profiling[5]),
           timesub(&profiling[5], &profiling[6]),
           timesub(&profiling[6], &profiling[7]),
           timesub(&profiling[7], &profiling[8]),
           timesub(&profiling[8], &profiling[9]));

       gettimeofday(&profiling[10], 0); // LOOP DONE
   }

The results really puzzled us. What might be the reason jumping back
from the end of a loop to the beginning took multiple minutes? In the
end, by adding another log line in the _beginning_ of the loop we
realized that the loop was indeed running very often -- but simply did
never complete. `profiling[10]` did never change when the server was stuck.

Knowing this the problem was obvious: when `select`ing the sockets and
waiting for messages, the server got interrupted all the time. This
results in the whole loop starting over from scratch, passing over
message handling and thus accounting of available worker daemons, too
few of them getting spawned.

        r = myselect(maxfd, &rfds, NULL, NULL, tvptr);
        if (r == -1 && errno == EAGAIN) continue;
        if (r == -1 && errno == EINTR) continue;
        if (r == -1) {
            /* uh oh */
            fatalf(1, "select failed: %m");
        }

This is a common pattern, if you want to `sleep` or `select`: try to do
it, and if you get interrupted or informed to try again, `continue` and
do again. But usually, this is enclosed in a tiny loop.

On a test machine, we were not able to completely reproduce the issue,
but have very well been able to observe the message handling part of the
loop not running for a dozen or more master loop runs. A small patch
replacing this with going on with the current loop instead of starting
over from scratch completely resolved the issue (for `master/master.c`
in Cyrus IMAP 2.5.7 release tar):

2481,2483c2481
< 	if (r == -1 && errno == EAGAIN) continue;
< 	if (r == -1 && errno == EINTR) continue;
< 	if (r == -1) {
---
> 	if (r == -1 && !(errno == EAGAIN || errno == EINTR)) {

This ran fine on our test setup, but we're still scared patching one of
the most central flows of logic in a very relevant service.

What is your opinion:

- Message handling was sometimes stuck for minutes. I guess we can agree
this should never happen on a high volume server.
- Might this be related to the issues we observed?
- Are there any consequences to the subsequent code (message handling
and reaping children) if we do _not_ start from scratch here?

If we get a "this looks fine, and you won't horribly mess up" from the
developers, we will have a try and patch the production machine as proposed.

Kind regards from Lake Constance, Germany,
Jens Erat

-- 
Jens Erat
Universität Konstanz
Kommunikations-, Infomations-, Medienzentrum (KIM)
Abteilung Basisdienste
D-78457 Konstanz
Mail: jens.erat at uni-konstanz.de

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4913 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.andrew.cmu.edu/pipermail/cyrus-devel/attachments/20160205/6a815c4c/attachment.p7s>