Possible issues in master.c (centry_add hiding entries, snmp_timeout applied when it shouldn't)
ellie timoney
ellie at fastmail.com
Sun Apr 3 22:38:49 EDT 2016
> ## snmp_timeout might be run when it shouldn't
>
> First of all, I have no experience at all with SNMP and what I observed
> is simply wrong or irrelevant. But reading the docs while putting
> together our `select` patch, I noted that `snmp_timeout` should only be
> run if `select` timed out (from `man 3 snmp_timeout`):
>
> > If select times out, snmp_timeout should be called to see if the
> timeout was intended for SNMP.
>
> The master loop calls `snmp_timeout()` unconditionally, though:
>
> r = myselect(maxfd, &rfds, NULL, NULL, tvptr);
> if (r == -1 && errno == EAGAIN) continue;
> if (r == -1 && errno == EINTR) continue;
> if (r == -1) {
> /* uh oh */
> fatalf(1, "select failed: %m");
> }
>
> #if defined(HAVE_UCDSNMP) || defined(HAVE_NETSNMP)
> /* check for SNMP queries */
> snmp_read(&rfds);
> snmp_timeout();
> #endif
>
> I'd propose to check for an empty `rfds`-set or zero return value before
> running `snmp_timeout` (not sure which one would be more appropriate).
I also don't know thing one about SNMP, and don't seem to have the devel
package installed, so I'm basing this solely on man pages found by
Google...
Background: myselect() is a thin wrapper around select()/pselect(), each
of which return the number of ready file descriptors (which is zero if
it timed out: none are ready), or -1 if there's an error. We fatal
earlier if it returned -1, so by the time we call snmp_read() and
snmp_timeout(), we know that r is either zero (timeout) or a count of
ready descriptors.
As far as I can tell, snmp_read() is only useful if there are ready
descriptors, but it's not a problem if there aren't (it just won't find
anything to do).
And, as far as I can tell, snmp_timeout() is only useful after a timeout
has occurred, but it's also not a problem if one hasn't (it just
probably won't find anything to do).
If my understanding is correct, by calling them unconditionally we're
possibly wasting time/energy, but not really hurting anything.
We could replace these two lines with something like:
if (r) snmp_read(&rfds);
else snmp_timeout();
... and maybe that would be faster. Or, maybe snmp_read() and
snmp_timeout()s checks are already pretty well optimised, and making
these calls conditional will throw off branch prediction and make things
slower, I dunno
On Fri, Apr 1, 2016, at 07:02 PM, Jens Erat via Cyrus-devel wrote:
> Dear Cyrus developers,
>
> when searching for the issue with our mail setup crashing all few days,
> I've been reading the Cyrus master code for quite some time and found
> two possible issues. By the way, I'll send out the promised feedback on
> our proposed patch within the next week or two; we're now running the
> patch without modified `-T` and everything seems fine so far.
>
>
> ## centry_add hides older entries
>
> This seems like a minor issue as it might result in broken state in rare
> cases, which repairs itself after "hiding" process is terminated again
> (which might take days or even weeks, though).
>
> `centry_add` does simply add a new entry to the ctable, but does not
> verify whether it already exists:
>
> /* add a centry to the global table of all
> * centries, using the given pid as the key */
> static void centry_add(struct centry *c, pid_t p)
> {
> c->pid = p;
> c->next = ctable[p % child_table_size];
> ctable[p % child_table_size] = c;
> }
>
> While debugging, we added a `centry_find` call before adding and indeed
> observed sporadic events when entries have been added, but not cleaned
> up yet. Mostly they occurred when Cyrus was stuck because of the
> `select` interrupting the master loop, but very rarely this also
> happened during normal operation. I must admit we had a situation with
> heavily increased probability for for such issues:
>
> - Solaris limited to 30k PIDs by default, while we're running 10-12k
> processes during normal operation
> - `-T 60` joined with a high prefork resulted in imapd processes
> terminated frequently
> - Single active, rather large machine serving a large number of users,
> lots of them using a webmail system with very short IMAP sessions
> - Because of the `select` bug, we often didn't have the child janitor
> cleaning up the ctable for seconds and minutes
>
> Generally, this might happen on every setup when having bad luck.
>
> To fix this issue and completely prevent such hiding of older processes,
> I'd propose to run `centry_find` before adding a new entry, and run the
> child janitor if needed.
>
>
> ## snmp_timeout might be run when it shouldn't
>
> First of all, I have no experience at all with SNMP and what I observed
> is simply wrong or irrelevant. But reading the docs while putting
> together our `select` patch, I noted that `snmp_timeout` should only be
> run if `select` timed out (from `man 3 snmp_timeout`):
>
> > If select times out, snmp_timeout should be called to see if the
> timeout was intended for SNMP.
>
> The master loop calls `snmp_timeout()` unconditionally, though:
>
> r = myselect(maxfd, &rfds, NULL, NULL, tvptr);
> if (r == -1 && errno == EAGAIN) continue;
> if (r == -1 && errno == EINTR) continue;
> if (r == -1) {
> /* uh oh */
> fatalf(1, "select failed: %m");
> }
>
> #if defined(HAVE_UCDSNMP) || defined(HAVE_NETSNMP)
> /* check for SNMP queries */
> snmp_read(&rfds);
> snmp_timeout();
> #endif
>
> I'd propose to check for an empty `rfds`-set or zero return value before
> running `snmp_timeout` (not sure which one would be more appropriate).
>
> Regards,
> Jens
>
> --
> Jens Erat
> Universität Konstanz
> Kommunikations-, Infomations-, Medienzentrum (KIM)
> Abteilung Basisdienste
> D-78457 Konstanz
> Mail: jens.erat at uni-konstanz.de
>
> Email had 1 attachment:
> + smime.p7s
> 7k (application/pkcs7-signature)
More information about the Cyrus-devel
mailing list