timsieved and lmtpd proxies get stuck after backend failure

Mon Jan 23 02:04:29 EST 2012

Hi!

Last Friday, one of the nodes in our Cyrus cluster got stuck, with apparently
only some parts of the network layer up (answered to ping, carried around the
RedHat cluster token). The Cyrus services were down for ~20 minutes before
being started in an another node.

What surprised me was the behaviour of the timsieved and lmtpd proxies in our
Murder frontends. When the backend failed, the proxies with an open connection
there got stuck, too. And there were many of them created; so many, in fact,
that the limit of lmtpd / timsieved processes was reached. (I'm still not sure
how that happened, since we certainly didn't have that many simultaneous sieve
sessions going on at that time. LMTP sessions I could almost believe; the
amount of email traffic here is considerable.)

However, the proxies remained stuck. On Friday, I tried to do some
investigation, and apparently, they were stuck on a read on the TCP socket. As
I didn't think of anything else to do, I killed the lmtpd proxies (normally,
that is, with signal 15), and that got the lmtp service running again (cyrus
master on the frontend was able to create new lmtpd processes again). But I
noticed the stuck Sieve processes only today; they'd been stuck on their
sockets since Friday.

I wonder why the read apparently never times out?

I'm sorry I cannot provide any more exact data than this. My first priority was
to get our Cyrus installation up and running.

--Janne
-- 
Janne Peltonen <janne.peltonen at helsinki.fi> PGP Key ID: 0x9CFAC88B
Please consider membership of the Hospitality Club (http://www.hospitalityclub.org)