One more attempt: stuck processes

Fri Nov 16 07:54:24 EST 2007

On Nov 16, 2007 12:36 PM, Sebastian Hagedorn <Hagedorn at uni-koeln.de> wrote:
> --On 16. November 2007 11:27:09 +0100 Sebastian Hagedorn
> <hagedorn at uni-koeln.de> wrote:
>
> >> 1) Since it only happens on dialup connections, could it be that the
> >> dialin router at the providers end sends TCP/RST when a client hangs up
> >> and those packets are filtered somewhere, maybe on your firewall?
> >
> > OK, let's run with that one.
> >
> > a) We don't really have a firewall, we only use ACLs on the Cisco
> > routers. You can't even filter TCP/RST there.
> >
> > b) Even *if* a TCP/RST had been dropped, lost or whatever, the server
> > *still* should timeout eventually!
>
> I just had a discussion with a colleague regarding this. He made two
> observations:
>
> 1. In the absence of the SO_KEEPALIVE option it is entirely possible that a
> TCP connection remains ESTABLISHED even when the other side has gone.

I said that socket should timeout, but this is true only when the
protocol (TCP here)
require a response (usualy AK here) or at connection establishement.
On the contrary
it should stay open indefinitely util something happens. Router doing
NAT can drop
a too old connection, because it has to maintains a NAT table and make some
cleanup time to time, this where "KEEPALIVE" become usefull.

>
> This may not be a solution to this particular problem, but it made me
> wonder why Cyrus does *not* use SO_KEEPALIVE. Is there a downside to it?

Cyrus has already a built-in time out, it seems a lite conflicting to actively
maintains the connection until it drop it itself !
This is the works of the client to actively maintains the connection,
if it want it !

>
> 2. The stack trace looks garbled:
>
> (gdb) bt
> #0  0x0079f41e in __read_nocancel () from /lib/tls/libc.so.6
> #1  0x00d0b2f7 in BIO_new_socket () from /lib/libcrypto.so.4
> #2  0x00d092b2 in BIO_read () from /lib/libcrypto.so.4
> #3  0x005dae13 in ssl23_read_bytes () from /lib/libssl.so.4
> #4  0x005d9c51 in ssl23_get_client_hello () from /lib/libssl.so.4
> #5  0x005d9712 in ssl23_accept () from /lib/libssl.so.4
> #6  0x005ddc9a in SSL_accept () from /lib/libssl.so.4
> #7  0x08052cb3 in shut_down ()
> #8  0x0804e513 in shut_down ()
> #9  0x0804d58c in ?? ()
> #10 0x00000001 in ?? ()
> #11 0x082ee848 in ?? ()
> #12 0x00000000 in ?? ()
>
> He suggested that the trace is unreliable. Perhaps a bug in RHEL 3's
> version of OpenSSL messes up the stack. That would also explain why nobody
> else seems to have this problem.
>
> I think I will try one more approach: I reverted cyrus.conf to not use "-U
> 1" anymore, so that processes should be reused. I will strace one of the
> pop3d processes in the hope that it gets stuck. That way I should be able
> to see where things go wrong. If the process terminates normally I will try
> with another one. If that doesn't go anywhere, I guess I'll drop this

You could try to replace imapd by a home made script, something like .

mv imapd imapd_
echo exec strace -o /tmp/imapd.$$ imapd_ $* > imapd
chmod imapd a+x

> investigation. We will upgrade to RHEL 5 some time next year, so hopefully
> that will bring new bugs :-)

Sorry but I dont understand what you are complaining about!
Is-it because the imap or pop client is loosing its connection and
this disturb the user
or just because you are getting some sleeping processes ? Or both :-)

Do you have a "timeout" option in your imapd.conf to force the
imap/pop server to autologout ?

Regards.

Alain
>
> --
>      .:.Sebastian Hagedorn - RZKR-R1 (Gebäude 52), Zimmer 18.:.
> Zentrum für angewandte Informatik - Universitätsweiter Service RRZK
> .:.Universität zu Köln / Cologne University - ✆ +49-221-478-5587.:.
>                    .:.:.:.Skype: shagedorn.:.:.:.
> ----
> Cyrus Home Page: http://cyrusimap.web.cmu.edu/
> Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
> List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html
>

-- 
Alain Spineux
aspineux gmail com
May the sources be with you