Experiment to test TCP keepalive for pop3d proxies

Gary Mills mills at cc.umanitoba.ca
Tue Jun 1 14:05:35 EDT 2010


On Fri, May 28, 2010 at 03:49:41PM -0400, Wesley Craig wrote:
> On 28 May 2010, at 12:42, Gary Mills wrote:
> > 0805e4ee proxy_check_input (815d168, 81a7228, 819e520, 81a3d60,  
> >81a7700, 0) + 5e
> 
> That last argument to proxy_check_input()?  It's the timeout.   
> Setting it to 0 means "don't time out".  I'm sure the theory is that  
> the underlying select() will return when the backend's poptimeout  
> happens, and the connection is closed.  It would be good to know why  
> that's not happening as expected.  Of course, the fact that bitpipe()  
> isn't checking the return value of prot_flush() is also bug.

Yes, the timeout is set to zero in the pop3d.c file.  However, the
idle timeout actually works when I test it.  In one window, I do this:

    $ telnet setup01 pop3
    Trying 130.179.16.64...
    Connected to setup01.cc.umanitoba.ca.
    Escape character is '^]'.
    +OK testing.umanitoba.ca Cyrus POP3 Murder v2.3.8 server ready
    user gmills
    +OK Name is a valid mailbox
    pass XXXXXX
    +OK Mailbox locked and ready
    /* wait for the timeout */
    -ERR [SYS/PERM] Fatal error: Lost connection to input stream
    Connection to setup01.cc.umanitoba.ca closed by foreign host.

Sure enough, on the server the new pop3d pop3d process exits after
20 minutes.  While it's waiting, the stack trace looks like this:

    # pstack 13804
    13804:  pop3d
     feb1a465 pollsys  (8042da0, 2, 8042e60, 0)
     feac3b8a pselect  (d, 8042eb4, feb90318, feb90318, 8042e60, 0) + 18e
     feac3e80 select   (d, 8042eb4, 0, 0, 8042ea8, 0) + 82
     0808981b prot_select (8189548, ffffffff, 8043f94, 0, 8042ea8, 0) + 44b
     0805e4ee proxy_check_input (8189548, 8145a30, 8145aa8, 814d718, 814d308, 0) + 5e
     0805dd74 bitpipe  (8145c38, 0, feb921ec, 0, 8044fed, 8044fed) + c4
     0805acb7 cmdloop  (8135594, 8138980, 14, 2, 31203133, 312e3033) + 27
     0805aa53 service_main (1, 8142a50, 8047db8) + 473
     08062c13 main     (1, 8047db0, 8047db8, feffb818) + a83
     08059bbd _start   (1, 8047e58, 0, 8047e5e, 8047e69, 8047e7c) + 7d

It stays in the pollsys system call the entire time but finally
returns with a zero return code.  The process then writes that error
message to FD 1, has a little dialogue with the back end, and then
terminates.

The ones I saw before were not stuck in pollsys() however.  They were
stuck in a read() from FD 0.  The timeout didn't work on those, but
the TCP keepalive does get them.  They had a very short stack trace,
like this:

    # pstack 12708
    12708:  pop3d -s
     feb1a5c5 read     (0, 817faf0, b)
     fec2dfaf sock_read () + 3f

I don't know why the stack trace is so short with these.

-- 
-Gary Mills-        -Unix Group-        -Computer and Network Services-


More information about the Info-cyrus mailing list