Severe issue with high-volume IMAP server stalling, possible patch

Fri Feb 5 22:19:42 EST 2016

Thanks for the report.  I see your point - there's definitely a DOS possibility here.

One thing I would suggest is running something like nginx in front of Cyrus, even if you only need it for authentication checks.

I believe this is the correct patch.  If the select was interrupted then it needs to be re-run, but there is no need to repeat all the other work, so:

-        r = myselect(maxfd, &rfds, NULL, NULL, tvptr);
-        if (r == -1 && errno == EAGAIN) continue;
-        if (r == -1 && errno == EINTR) continue;
+        do {
+            r = myselect(maxfd, &rfds, NULL, NULL, tvptr);
+        } while (r == -1 && (errno == EAGAIN || errno == EINTR));

I'm quite confident this is correct, so I will roll it out to FastMail's testing servers today.  If they are happy, I'll push it upstream with backports for 2.5 and 2.4.

Cheers,

Bron.

On Sat, Feb 6, 2016, at 03:58, Jens Erat via Cyrus-devel wrote:
> Dear Cyrus maintainers,
> 
> the very short version: we probably found an issue resulting in the
> master stalling high-volume IMAP servers. A possible patch is attached,
> but needs some more discussion.
> 
> - - -
> 
> At the University of Konstanz, we're running a rather large IMAP server
> for about 18k users on some large Oracle/Solaris machines. We don't have
> a murder/cluster. The machine is overprovisioned to last for some years
> and is running blazingly fast during normal operation. We're running on
> Solaris 11 and ZFS, the mailboxes are stored on a network storage
> attached through Fiberchannel (and we don't observe any I/O issues). The
> dual socket machine is equipped with 256GB of memory. We tried both
> compiling with GnuCC 4.8 as well as Sun CC (from Solaris Studio).
> 
> Then, from time to time, it completely stalled -- usually at times with
> high fork rate. We observed this issue to happen in the morning when
> lots of people started reading their mails, we experienced it during a
> DDOS attack on our network, which made the firewall drop lots of
> connections (and the mail clients trying to reconnect instantly).
> 
> As a result, the mail server was still running fine for everybody still
> holding a connection, but denied service for pretty much all new
> connecting clients. We had to restart the whole Cyrus IMAPd service to
> recover from this issue.
> 
> 
> We started getting a clue of what's going wrong when we actually added
> debug information and some ctable verifications, as we expected
> something would be wrong with how the active/ready workers are counted.
> In the end, we also took time checkpoints during the master loop to see
> _where_ it is stuck. In the end, it seemed, this was while jumping back
> from the very last command (getting a timestamp in the loop) to the very
> first (also getting a timestamp). The code pretty much looked like:
> 
>    struct timeval profiling[11];
>    gettimeofday(&profiling[10], 0); // LOOP DONE
>    for (;;) {
>        gettimeofday(&profiling[0], 0); // INIT
> 
>        // Start scheduled tasks
> 
>        gettimeofday(&profiling[1], 0); // SCHEDULE
> 
>        // Other master loop tasks: spawn, message handling, ...
> 
>        gettimeofday(&profiling[9], 0); // SNMP_ALARMS
> 
>        syslog(LOG_INFO, "MASTERLOOP_PROFILING:
> %f,%f,%f,%f,%f,%f,%f,%f,%f,%f",
>            timesub(&profiling[0], &profiling[1]),
>            timesub(&profiling[0], &profiling[1]),
>            timesub(&profiling[1], &profiling[2]),
>            timesub(&profiling[2], &profiling[3]),
>            timesub(&profiling[3], &profiling[4]),
>            timesub(&profiling[4], &profiling[5]),
>            timesub(&profiling[5], &profiling[6]),
>            timesub(&profiling[6], &profiling[7]),
>            timesub(&profiling[7], &profiling[8]),
>            timesub(&profiling[8], &profiling[9]));
> 
>        gettimeofday(&profiling[10], 0); // LOOP DONE
>    }
> 
> The results really puzzled us. What might be the reason jumping back
> from the end of a loop to the beginning took multiple minutes? In the
> end, by adding another log line in the _beginning_ of the loop we
> realized that the loop was indeed running very often -- but simply did
> never complete. `profiling[10]` did never change when the server was stuck.
> 
> Knowing this the problem was obvious: when `select`ing the sockets and
> waiting for messages, the server got interrupted all the time. This
> results in the whole loop starting over from scratch, passing over
> message handling and thus accounting of available worker daemons, too
> few of them getting spawned.
> 
>         r = myselect(maxfd, &rfds, NULL, NULL, tvptr);
>         if (r == -1 && errno == EAGAIN) continue;
>         if (r == -1 && errno == EINTR) continue;
>         if (r == -1) {
>             /* uh oh */
>             fatalf(1, "select failed: %m");
>         }
> 
> This is a common pattern, if you want to `sleep` or `select`: try to do
> it, and if you get interrupted or informed to try again, `continue` and
> do again. But usually, this is enclosed in a tiny loop.
> 
> On a test machine, we were not able to completely reproduce the issue,
> but have very well been able to observe the message handling part of the
> loop not running for a dozen or more master loop runs. A small patch
> replacing this with going on with the current loop instead of starting
> over from scratch completely resolved the issue (for `master/master.c`
> in Cyrus IMAP 2.5.7 release tar):
> 
> 2481,2483c2481
> < 	if (r == -1 && errno == EAGAIN) continue;
> < 	if (r == -1 && errno == EINTR) continue;
> < 	if (r == -1) {
> ---
> > 	if (r == -1 && !(errno == EAGAIN || errno == EINTR)) {
> 
> This ran fine on our test setup, but we're still scared patching one of
> the most central flows of logic in a very relevant service.
> 
> What is your opinion:
> 
> - Message handling was sometimes stuck for minutes. I guess we can agree
> this should never happen on a high volume server.
> - Might this be related to the issues we observed?
> - Are there any consequences to the subsequent code (message handling
> and reaping children) if we do _not_ start from scratch here?
> 
> If we get a "this looks fine, and you won't horribly mess up" from the
> developers, we will have a try and patch the production machine as proposed.
> 
> Kind regards from Lake Constance, Germany,
> Jens Erat
> 
> 
> -- 
> Jens Erat
> Universität Konstanz
> Kommunikations-, Infomations-, Medienzentrum (KIM)
> Abteilung Basisdienste
> D-78457 Konstanz
> Mail: jens.erat at uni-konstanz.de
> 
> Email had 1 attachment:
> + smime.p7s
>   7k (application/pkcs7-signature)

-- 
  Bron Gondwana
  brong at fastmail.fm