followup: stuck lmtpd processes
John Wade
jwade at oakton.edu
Wed Sep 24 05:46:38 EDT 2003
Oooppss. Sorry, my mailbox went temporarily over quota and the delivery
of the original thread was deferred until after I had read and responded
to the followup. It looks like the locking mechanism is working
correctly here and the bug is really in the network timeout. (or in the
implementation that allows a network call after the append setup is called)
The patch I wrote still might help you since it would prevent an
individual user's problem from taking down the mail system. The user's
mailbox would remain inaccessible, but the lmtpd processes attempting
delivery would exit with errors and mail delivery as a whole would
proceed. I still belive that a system with that uses locking with no
timeout mechanism is inherently fragile. A single programming error can
lead to a cascade failure that takes down the entire mail system as more
and more processes hang up trying to get the lock.
Just my two cents,
John
Andrew Morgan wrote:
>On Tue, 23 Sep 2003, John Wade wrote:
>
>
>
>>Hi Andrew,
>>
>>I was the one who wrote the message you found. I finally came to the
>>conclusion that the flat file locking mechanism is somewhat broken in
>>Cyrus, but I was never a good enough C programmer to pin down what was
>>happening. (The mmap stuff makes it really tricky to debug.) I
>>wanted to blame it on the Linux kernel, but I know that others have
>>experienced the same problems in Solaris.
>>
>>I finally gave up and wrote a locking timeout patch for 2.0.16. see
>>http://www.oakton.edu/~jwade/cyrus/ for the patch and full details
>>
>>A number of other folks have tried this patch successfully on 2.0.16 and
>>2.1.x, and I know it has resolved our problem.
>>
>>If you can solve the particular bug that causes this, more power to you,
>>if not, my work around resolves a number of possible deadlock issues.
>>
>>Enjoy,
>>John
>>
>>
>
>Hey John,
>
>Thanks for that message. If you've read a little further in your
>info-cyrus messages, you'll see that I apparently have hit upon a
>different bug than the one you found (I think). Your page was
>instrumental in helping me track down the source of the problem though.
>
>It turns out I had an imaps process that hung onto the lock on the user's
>quota file. Apparently it obtained the lock, then went off to read from
>the network connection and never came back.
>
>I think your patch would fix the problem where are lot of processes are
>contending for a lock (by making them retry), but it wouldn't help if a
>single process keeps the lock indefinately. Ideally it should not be
>possible for a process to get hung while it is holding the lock, but that
>will require some careful programming in this particular case. In the
>meantime, I'll have to keep an eye on the system.
>
>Thanks again for your debugging clues...
>
> Andy
>
>
>
>
More information about the Info-cyrus
mailing list