Experiment to test TCP keepalive for pop3d proxies

Thu May 27 20:52:18 EDT 2010

Since you mention it...

I took a look at a random frontend, and found 27 or 33 pop processes  
from two days ago.  I used gdb to get stack traces from 3 samples,  
all looked like this:

(gdb) where
#0  0x008007a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x008d6ff3 in __read_nocancel () from /lib/tls/libc.so.6
#2  0x0806cb77 in prot_fill (s=0x8ecc148) at prot.c:470
#3  0x0806d924 in prot_fgets (buf=0xbff48160 "", size=2047, s=0x8ecc148)
     at prot.c:1186
#4  0x0804f57e in backend_connect (ret_backend=0x0,
     server=0x81045a0 "some.server", prot=0x80fad20,
     userid=0xbff49cb0 "someuser", cb=0x0, auth_status=0xbff48a40)
     at backend.c:477
#5  0x0804c8df in openinbox () at pop3d.c:1635
#6  0x0804d6d9 in cmdloop () at pop3d.c:1227
#7  0x0804e6ad in service_main (argc=2, argv=0x8e6e008, envp=0xbff4ebf8)
     at pop3d.c:579
#8  0x08052374 in main (argc=4, argv=0xbff4ebe4, envp=0xbff4ebf8)
     at service.c:540
#9  0x0082ee93 in __libc_start_main () from /lib/tls/libc.so.6
#10 0x0804ba81 in ?? ()
(gdb)

In other words, they were all waiting in backend_connect() for the  
backend server.  That's not what's going on in your case, tho.

Looking at the code in backend_connect(), it's pretty clear that no  
timeout is set when retrieving the banner.  That's a bug, and it  
impacts *every* tool that uses backend_connect() to communicate  
within the cluster.  It may not be your problem, but it's definitely  
*a* problem.  A simple:

	prot_settimeout( ret->in, 360 );

right after:

	ret->in = prot_new(sock, 0);

would probably do the trick (totally untested, to be sure).

For your problem, pop3d calls:

	prot_settimeout(popd_in, popd_timeout);

just below where you've inserted the KEEPALIVE.  What do you have  
poptimeout set to?  I wouldn't be surprised by a bug in prot, BTW.   
I'm pretty sure I've seen a case where select() is used to implement  
the timeout but once there's *some* input, read() is called with  
blocking (wrong!).

In any case, if you can get a traceback with gdb for some hung  
pop3d's, I'm sure we can pinpoint the issue.

:wes

On 27 May 2010, at 17:52, Gary Mills wrote:
> Ever since I can remember, our Cyrus installation had a problem with
> pop3d processes accumulating on the murder front end server.  This
> didn't happen with imapd processes or with pop3d on the back end.  A
> couple of weeks ago, I counted 423 pop3d processes on the front end
> but only 37 on the back end.  Some of them were months old.  All had
> an established TCP connection from a client.  Here's a typical stack
> trace:
>
>     # pstack 12708
>     12708:  pop3d -s
>      feb1a5c5 read     (0, 817faf0, b)
>      fec2dfaf sock_read () + 3f
>
> POP3 timeouts were enabled on both front and back ends, but it seemed
> not to work on the front end.  We're still running cyrus-imapd-2.3.8.
> It's possible that this problem is fixed in the current version,
> cyrus-imapd-2.3.16.
>
> In any case, I wanted to try enabling TCP keepalive to see if it had
> any effect on the problem.  This only required a few lines of code:
>
>     --- pop3d.c-nokeep      Wed Apr 11 10:49:59 2007
>     +++ pop3d.c     Mon May 17 18:17:22 2010
>     @@ -494,6 +494,12 @@
>             if (getsockname(0, (struct sockaddr *)&popd_localaddr,  
> &salen) == 0) {
>                 popd_haveaddr = 1;
>             }
>     +       /* Set keepalive option */
>     +       {
>     +         int oval = 1;
>     +         (void)setsockopt(0, SOL_SOCKET, SO_KEEPALIVE, (const  
> void *)&oval,
>     +                    sizeof(oval));
>     +       }
>          }
>
>          /* other params should be filled in */
>
> A complete installation would include a configuration setting to
> enable or disable TCP keepalive, along with ways to set keepalive
> values that exist in many operating systems.  This was just a test,
> but it was quite impressive.  `pop3d' processes no longer accumulated
> on the front end, but were similar in number to the ones on the back
> end.  The cause must have been clients that disappeared without
> closing their TCP connections.  The TCP keepalive mechanism now does
> this for them, after about half an hour of idleness.
>
> Does anyone know if this problem has been solved by a timeout in
> later Cyrus versions?  That's actually a better solution.  It does
> only seem to happen when pop3d runs on a murder front end, relaying
> connections to a back end.  If it hasn't been solved, I'll proceed
> with the keepalive solution.  Otherwise, I'll plan for an upgrade.