[RFC PATCH] Fix index lock breakage in mailbox_release_resources()

Thomas Jarosch thomas.jarosch at intra2net.com
Mon Jul 20 03:28:08 EDT 2015


Hi Bron,

On Sunday, 19. July 2015 10:05:12 you wrote:
> On Sun, Jul 19, 2015, at 07:06 AM, Thomas Jarosch wrote:
> > mailbox_release_resources() closes the file descriptor
> > to the index. This implicitly releases any lock on the file.
> > 
> > Unfortunately the function does not reset the index_locktype state.
> > Therefore we didn't notice that a mailbox rename
> > operated on an unlocked index of the new mailbox. Yikes.
> 
> How did you discover this?  I'm surprised we haven't run into this
> already.

currently cyrus-imapd 2.4.18 deadlocks about once a week.
Similar symptoms as described here:
http://lists.andrew.cmu.edu/pipermail/info-cyrus/2013-May/036951.html

So I'm now going through the locking code. Of course I can't reproduce
it on my development machine and I tried really hard.

The only pattern I can see is that it happens one machines
with very large mailboxes. They run "quota -f" nightly,
ipurge to delete trashed mails older than 30 days
and the cyr_getdeleted tool I posted recently.

-> the more cyrus tools we run in parallel, the worse
the problem seems.

> > Note: mailbox_index_lock() implies a mailbox_read_index_header().
> > But it also does more work than just read the index header,
> > so there might be side effects.
> 
> I changed to mailbox_index_lock_internal() since we don't want to abort
> on a DELETED mailbox.

ok

> > @@ -4370,7 +4371,7 @@ static int mailbox_reconstruct_create(const char
> > *name, struct mailbox **mbptr)> 
> >      /* Attempt to open index */
> >      r = mailbox_open_index(mailbox);
> > 
> > -    if (!r) r = mailbox_read_index_header(mailbox);
> > +    if (!r) r = mailbox_lock_index(mailbox, LOCK_EXCLUSIVE);
> > 
> >      if (r) {
> >  	
> >  	printf("%s: failed to read index header\n", mailbox->name);
> >  	syslog(LOG_ERR, "failed to read index header for %s", mailbox-
>name);
> 
> I'm not convinced by this one - the whole point with reconstruct is that
> you're working with a potentially broken mailbox, and you want to check
> each step more simply to see if it works.  I've left this one in place.

if you back out that change, it won't work anymore: 
mailbox_read_index_header() will bail out because the index is not locked. 
I've tested that code path by faking the return code of mailbox_open() so it 
executes the reconstruct_create() code path.


Right now I've interfaced the locking code with helgrind's lock validation 
mechanism and annotated all lock_xx() calls. That works more or less.
Since kernel 3.14.x, there's a userspace version of "lockdep" called 
liblockdep. I'll switch to that soon as it supports "tryLock()" style calls 
as we do with lock_nonblocking().

The tricky part is that cyrus is a multi-process tool instead of a multi-
threaded one, so the current lock validation tools are more or less useless.
I came up with a workaround: Dump the lock state information in lock_xxx()
to a file, probably JSON. A "daemon" tool will later on read that "lock 
transation file" and feed the information in the lock validation tool / 
library. Hopefully this works out, I have a feeling we have a classic
ABBA deadlock somewhere in the code.

Cheers,
Thomas



More information about the Cyrus-devel mailing list