Experiment to test TCP keepalive for pop3d proxies
Wesley Craig
wes at umich.edu
Thu May 27 20:52:18 EDT 2010
Since you mention it...
I took a look at a random frontend, and found 27 or 33 pop processes
from two days ago. I used gdb to get stack traces from 3 samples,
all looked like this:
(gdb) where
#0 0x008007a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x008d6ff3 in __read_nocancel () from /lib/tls/libc.so.6
#2 0x0806cb77 in prot_fill (s=0x8ecc148) at prot.c:470
#3 0x0806d924 in prot_fgets (buf=0xbff48160 "", size=2047, s=0x8ecc148)
at prot.c:1186
#4 0x0804f57e in backend_connect (ret_backend=0x0,
server=0x81045a0 "some.server", prot=0x80fad20,
userid=0xbff49cb0 "someuser", cb=0x0, auth_status=0xbff48a40)
at backend.c:477
#5 0x0804c8df in openinbox () at pop3d.c:1635
#6 0x0804d6d9 in cmdloop () at pop3d.c:1227
#7 0x0804e6ad in service_main (argc=2, argv=0x8e6e008, envp=0xbff4ebf8)
at pop3d.c:579
#8 0x08052374 in main (argc=4, argv=0xbff4ebe4, envp=0xbff4ebf8)
at service.c:540
#9 0x0082ee93 in __libc_start_main () from /lib/tls/libc.so.6
#10 0x0804ba81 in ?? ()
(gdb)
In other words, they were all waiting in backend_connect() for the
backend server. That's not what's going on in your case, tho.
Looking at the code in backend_connect(), it's pretty clear that no
timeout is set when retrieving the banner. That's a bug, and it
impacts *every* tool that uses backend_connect() to communicate
within the cluster. It may not be your problem, but it's definitely
*a* problem. A simple:
prot_settimeout( ret->in, 360 );
right after:
ret->in = prot_new(sock, 0);
would probably do the trick (totally untested, to be sure).
For your problem, pop3d calls:
prot_settimeout(popd_in, popd_timeout);
just below where you've inserted the KEEPALIVE. What do you have
poptimeout set to? I wouldn't be surprised by a bug in prot, BTW.
I'm pretty sure I've seen a case where select() is used to implement
the timeout but once there's *some* input, read() is called with
blocking (wrong!).
In any case, if you can get a traceback with gdb for some hung
pop3d's, I'm sure we can pinpoint the issue.
:wes
On 27 May 2010, at 17:52, Gary Mills wrote:
> Ever since I can remember, our Cyrus installation had a problem with
> pop3d processes accumulating on the murder front end server. This
> didn't happen with imapd processes or with pop3d on the back end. A
> couple of weeks ago, I counted 423 pop3d processes on the front end
> but only 37 on the back end. Some of them were months old. All had
> an established TCP connection from a client. Here's a typical stack
> trace:
>
> # pstack 12708
> 12708: pop3d -s
> feb1a5c5 read (0, 817faf0, b)
> fec2dfaf sock_read () + 3f
>
> POP3 timeouts were enabled on both front and back ends, but it seemed
> not to work on the front end. We're still running cyrus-imapd-2.3.8.
> It's possible that this problem is fixed in the current version,
> cyrus-imapd-2.3.16.
>
> In any case, I wanted to try enabling TCP keepalive to see if it had
> any effect on the problem. This only required a few lines of code:
>
> --- pop3d.c-nokeep Wed Apr 11 10:49:59 2007
> +++ pop3d.c Mon May 17 18:17:22 2010
> @@ -494,6 +494,12 @@
> if (getsockname(0, (struct sockaddr *)&popd_localaddr,
> &salen) == 0) {
> popd_haveaddr = 1;
> }
> + /* Set keepalive option */
> + {
> + int oval = 1;
> + (void)setsockopt(0, SOL_SOCKET, SO_KEEPALIVE, (const
> void *)&oval,
> + sizeof(oval));
> + }
> }
>
> /* other params should be filled in */
>
> A complete installation would include a configuration setting to
> enable or disable TCP keepalive, along with ways to set keepalive
> values that exist in many operating systems. This was just a test,
> but it was quite impressive. `pop3d' processes no longer accumulated
> on the front end, but were similar in number to the ones on the back
> end. The cause must have been clients that disappeared without
> closing their TCP connections. The TCP keepalive mechanism now does
> this for them, after about half an hour of idleness.
>
> Does anyone know if this problem has been solved by a timeout in
> later Cyrus versions? That's actually a better solution. It does
> only seem to happen when pop3d runs on a murder front end, relaying
> connections to a back end. If it hasn't been solved, I'll proceed
> with the keepalive solution. Otherwise, I'll plan for an upgrade.
More information about the Info-cyrus
mailing list