Experiment to test TCP keepalive for pop3d proxies

Wesley Craig wes at umich.edu
Tue Jun 1 14:34:58 EDT 2010


On 01 Jun 2010, at 14:05, Gary Mills wrote:
> Yes, the timeout is set to zero in the pop3d.c file.  However, the
> idle timeout actually works when I test it.  In one window, I do this:
>
>     $ telnet setup01 pop3
>     Trying 130.179.16.64...
>     Connected to setup01.cc.umanitoba.ca.
>     Escape character is '^]'.
>     +OK testing.umanitoba.ca Cyrus POP3 Murder v2.3.8 server ready
>     user gmills
>     +OK Name is a valid mailbox
>     pass XXXXXX
>     +OK Mailbox locked and ready
>     /* wait for the timeout */
>     -ERR [SYS/PERM] Fatal error: Lost connection to input stream
>     Connection to setup01.cc.umanitoba.ca closed by foreign host.
>
> Sure enough, on the server the new pop3d pop3d process exits after
> 20 minutes.  While it's waiting, the stack trace looks like this:
>
>     # pstack 13804
>     13804:  pop3d
>      feb1a465 pollsys  (8042da0, 2, 8042e60, 0)
>      feac3b8a pselect  (d, 8042eb4, feb90318, feb90318, 8042e60, 0)  
> + 18e
>      feac3e80 select   (d, 8042eb4, 0, 0, 8042ea8, 0) + 82
>      0808981b prot_select (8189548, ffffffff, 8043f94, 0, 8042ea8,  
> 0) + 44b
>      0805e4ee proxy_check_input (8189548, 8145a30, 8145aa8,  
> 814d718, 814d308, 0) + 5e
>      0805dd74 bitpipe  (8145c38, 0, feb921ec, 0, 8044fed, 8044fed)  
> + c4
>      0805acb7 cmdloop  (8135594, 8138980, 14, 2, 31203133,  
> 312e3033) + 27
>      0805aa53 service_main (1, 8142a50, 8047db8) + 473
>      08062c13 main     (1, 8047db0, 8047db8, feffb818) + a83
>      08059bbd _start   (1, 8047e58, 0, 8047e5e, 8047e69, 8047e7c) + 7d
>
> It stays in the pollsys system call the entire time but finally
> returns with a zero return code.  The process then writes that error
> message to FD 1, has a little dialogue with the back end, and then
> terminates.

Perhaps we could get an strace (or equivalent) of the above  
happening.  With select() waiting forever, the poptimeout can't be  
(directly) causing select() to return.  If poptimeout is set on the  
backend, that would explain the behavior.

> The ones I saw before were not stuck in pollsys() however.  They were
> stuck in a read() from FD 0.  The timeout didn't work on those, but
> the TCP keepalive does get them.  They had a very short stack trace,
> like this:
>
>     # pstack 12708
>     12708:  pop3d -s
>      feb1a5c5 read     (0, 817faf0, b)
>      fec2dfaf sock_read () + 3f
>
> I don't know why the stack trace is so short with these.

Being stuck in read I think reflects a bug in prot.  In particular,  
prot ought to be using non-blocking IO when there are timeouts.   
Instead, it uses select().  But select() simply tells the user  
process that "something" has happened.  It does not necessarily imply  
that read() is going to return immediately.

To get a better stack trace, I'd recommend using a full debugger  
rather than pstack.  Also, you might haver better luck using gcore  
first.

:wes


More information about the Info-cyrus mailing list