[RFC PATCH v2] imapd.c: imapoptions: implement idle timeout

Tue Sep 20 14:20:29 EDT 2016

On 09/20/2016 08:14 AM, Andy Dorman wrote:
> On 09/19/2016 09:02 PM, ellie timoney via Cyrus-devel wrote:
>> I've been looking at tcp_keepalive a bit lately and I'm wondering how it
>> interacts with this?
>>
>> It's my understanding that, in most cases, tcp_keepalive will do the job
>> of detecting clients that have dropped out, and allow us to close the
>> connection on our end.  Since we're generally either waiting for a
>> command from the client, or producing and about to send output to the
>> client, this works -- because if tcp_keepalive detects that the client
>> isn't there, reads and writes to the socket will start failing.
>>
>> But during the IDLE state, we only read from the client socket if select
>> reports it as having data ready for reading (presumably containing
>> "DONE"), and we only write to the client socket if there is activity on
>> the selected mailbox.
>>
>> If the client's connection has dropped out, no data will ever appear on
>> the socket, so select will never flag it as readable, so we will never
>> try to read from it, so we will never receive the read error even though
>> tcp_keepalive detected the dropout.  And if this client was idling with
>> a low-activity mailbox selected (such as Drafts or Sent), it might be a
>> very long time before any activity prompts us to write to the socket, so
>> we also don't receive the write error.  And so even though the socket
>> itself knows there's no connection anymore thanks to tcp_keepalive, we
>> don't know that, because we haven't tried to interact with it.  And so
>> the connection/process doesn't get cleaned up.
>>
>> And so I think this patch is meant to provide an extra protection from
>> this case.  tcp_keepalive is fine generally, but idling clients can slip
>> through the cracks in certain circumstances, so let's fill those cracks.
>>  Does that sound right?
>>
>> In writing this, I wonder what happens if a client initiates IDLE
>> without having first selected a mailbox.  To my reading, RFC 2177
>> implies that this is sort of pointless, but doesn't make an explicit
>> statement about it one way or another.  I don't know what Cyrus actually
>> does in this case -- there's something to investigate -- but I guess if
>> there's a crack there, the imapidletimeout patch will fill that too.
>>
>> Any thoughts?
>>
>> Cheers,
>>
>> ellie
>>
>> On Wed, Sep 14, 2016, at 05:11 PM, Thomas Jarosch wrote:
>>> Hi Ellie,
>>>
>>> On Monday, 12. September 2016 11:35:45 ellie timoney wrote:
>>> [clock jumps]
>>>> Or does it?  The man page says it's "not  affected by discontinuous
>>>> jumps in the system time (e.g., if the system administrator manually
>>>> changes the clock)" -- great -- "but is affected by the incremental
>>>> adjustments performed by adjtime(3) and NTP".  Which sounds to me like
>>>> NTP might still be an issue?   (But: I have no real world experience of
>>>> this, I'm just reading man pages here.)
>>>
>>> Good point. Not sure here, we didn't encounter an
>>> issue for a long time. The event itself is rather rare these days.
>>>
>>>>> Would it make sense to enable the timeout by default?
>>>>> In the current version of the patch it's disabled (value 0).
>>>>
>>>> I'm interested in hearing thoughts on this, particularly with regard to
>>>> what a reasonable default timeout might be.  Though I like the "no
>>>> behaviour change unless you change configuration" aspect of defaulting
>>>> to 0.
>>>
>>> We'll push out the three days default value next week.
>>> I can report back in a month how good or bad the results are.
>>>
>>> Cheers,
>>> Thomas
>>>
>
> Ellie, I agree a "crack" exists that idled processes may be slipping
> through but so far I have little data to prove it.
>
> Empirically I have one server with two clients (I have moved everyone
> else to other servers to decrease the number of variables), and the
> process count in the IDLED state for those two clients grows apparently
> without bound (at least I haven't found an upper limit yet).  I have
> been increasing the point at which I am alerted for "excess imapd
> processes" and it is up to 100 processes now.  After about 24 hours
> these two clients reach that point and every process in
> /var/run/cyrus/proc/ is attributed to them like this.
>
> imap  hermione.ironicdesign.com [192.168.0.17]  b2b at cogift.co
> cogift.co!user.b2b  Idle
>
> hermione is our nginx load balancer on an internal network.
>
> As far as we have been able to tell, no other client has this problem.
>
> Another data point...these are very low traffic accounts (6 emails for
> one and 0 for the other in the last week).
>
> I am going to contact the owner of these two accounts today and ask her
> what client she is using and how often she has it set up to check email.
>

OK, the client that appears to be causing our problem with apparently 
abandoned imapd idled processes is an old (possibly more than 5 years 
old) BlackBerry.  Our client has it set to check all her mailboxes every 
10-15 minutes.  I have no idea how reliable the connectivity is for this 
BB, but given how active she is I would not be surprised to hear that 
the BB regularly loses connectivity when she is driving (which she does 
a lot).

Unless anyone has another suggestion, I plan to use wireshark to capture 
port 143 packets (given the age of the BB I bet it doesn't talk the 
versions of TLS we will accept and we no longer accept SSL, so port 993 
is a no-go for it) to these two addresses. Given that their email 
traffic is so light (less than 1/day) I should be able to capture lots 
of connections with no mail.

FWIW, I have captured and analyzed lots of packets before, but never 
IMAP...so if anyone has any hints about what to look for, feel free to 
speak up.  ;-)

-- 
Andy Dorman