negative pop3d worker count

Jules Agee julesa at pcf.com
Thu Apr 7 18:35:00 EDT 2005


Wow, I'm surprised this hasn't been fixed upstream yet... I had this 
problem about a year ago.

The master process runs in a loop, checking it's communication pipes 
with all the different types of child processes it spawns, keeping track 
of the child processes, and distributing work to them. The child 
processes send messages to the master like "I'm ready to handle a new 
connection" or "I'm busy" etc. via pipes.

What was happening to me, and looks like is happening to you, is that 
the master gets a sigchld signal, and executes the reap_child() process 
*while there are still messages from the child in the pipe*. Then when 
it goes to check the pipe, it finds this "busy" or "ready" message from 
the dead child process. But because it already executed reap_child(), it 
no longer has a record of this process in its table of children. So it 
assumes the child process is so brand-new that it isn't in the child 
table yet (thus the "not registered yet" log message) and creates a new 
entry in the child table with status UNKNOWN.

Contributing to the problem was the fact that the master process was 
only dealing with one message from each service type each time it 
looped. From master.c:

	    if (FD_ISSET(x, &rfds)) {
		r = read(x, &msg, sizeof(msg));
		if (r != sizeof(msg)) {
		    syslog(LOG_ERR, "got incorrectly sized response from child: %x", i);
		    continue;
		}
		
		process_msg(i, &msg);
	    }

Henrique gave me a patch which fixed this behavior, replacing the above 
segment with the code between #####:

#####
           if (FD_ISSET(x, &rfds)) {
                 while ((r = read_msg(x, &msg)) == 0)
                     process_msg(&Services[i], &msg);

                 if (r == 2) {
                     syslog(LOG_ERR,
                         "got incorrectly sized response from child: 
%x", i);
                     continue;
                 }
                 if (r < 0) {
                     syslog(LOG_ERR,
                         "error while receiving message from child %x: 
%m", i);
                     continue;
                 }
             }
#####

And adding this new function to master.c:

int read_msg(int fd, struct notify_message *msg)
{
         ssize_t r;
         size_t off = 0;
         int s = sizeof(*msg);

         while (s > 0) {
             do
                 r = read(fd, msg + off, s);
             while ((r == -1) && (errno == EINTR));
             if (r <= 0) break;
             s -= r;
             off += r;
         }
         if ( ((r == 0) && (off == 0)) ||
              ((r == -1) && (errno == EAGAIN)) )
             return 1;
         if (r == -1) return -1;
         if (s != 0) return 2;
         return 0;
}




Wolfgang Powisch wrote:
> Hi,
> 
> Some days ago I had a strange problem with pop3d, suddenly not accepting
> connections anymore. pop3d was set with maxchilds=60 in cyrus.conf and
> we seamed to have really reached 60 simultanous pop3 connections at this
> point.
>  From the logs, the master process complaine about UNKNOWN states and
> a negative number of pop3d worker processes, and this negtive number was
> increasing continuously.
> Some minutes after restarting cyrus completely, the same again.
> Then I increased pop3 maxchilds to 100, restarted cyrus and no problem
> anymore.
> 
> I'm using cyrus-imapd-2.2.12 with perdition as pop3/imap frontend.
> What should be the expected behaviour when reaching the maxchilds limit.
> What's the reason for the negative worker number ?
> 
> 
> 
> Part of the logs:
> 
> Apr  5 15:23:19 imap01 master[18697]: about to exec 
> /usr/lib/cyrus-imapd/pop3d
> Apr  5 15:23:19 imap01 master[4065]: service pop3 pid 17896: while 
> trying to process message 0x3: not registered yet
> Apr  5 15:23:20 imap01 master[4065]: service pop3 pid 17896 in UNKNOWN 
> state: processing message 0x3
> Apr  5 15:23:21 imap01 master[4065]: service pop3 pid 17896 in UNKNOWN 
> state: now in BUSY state and serving connection
> Apr  5 15:23:22 imap01 master[4065]: process 17383 exited, status 0
> Apr  5 15:23:23 imap01 master[4065]: process 18647 exited, status 0
> Apr  5 15:23:25 imap01 master[4065]: process 18650 exited, status 0
> Apr  5 15:23:26 imap01 master[4065]: process 18697 exited, status 0
> Apr  5 15:23:27 imap01 master[4065]: process 17734 exited, status 0
> Apr  5 15:23:28 imap01 master[18768]: about to exec 
> /usr/lib/cyrus-imapd/pop3d
> Apr  5 15:23:28 imap01 master[4065]: pop3 has -1 workers?!?
> 
> 
> 
> regards,
> 

---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html




More information about the Info-cyrus mailing list