IMPORTANT - bugfix sync_append_commit index breakage

Fri Aug 31 19:14:45 EDT 2007

On Fri, Aug 31, 2007 at 05:51:35PM +0100, David Carter wrote:
> On Fri, 31 Aug 2007, Ken Murchison wrote:
>
>> I think your patch makes sense, but I'm not sure when "The cyrus index 
>> file format has this clever "ignore" junk at the end until the exists 
>> count changes trick" means.  I know we leave junk in the cache file as a 
>> result of delayed expunge which gets cleaned up later, but I'm pretty sure 
>> the the index file is always tightly packed.  The is extra space in the 
>> index header, but there shouldn't be any between the index records.
>
> I think that the danger is that if sync_server gets shut down uncleanly 
> (which I know was happening to Fastmail a lot at one point) then then you 
> can end up with a bogus entry at the end of a cyrus.index file which is not 
> overwritten by the next sync_append_commit() on that mailbox.

"Uncleanly" means any time it gets a sigterm in the middle of a
delivery at the wrong point.  On a busy server with a lot of IO
this tends to be about every shutdown for a couple of mailboxes
(as a datapoint, we failed over 5 slots on one server for about
2 hours while we upgraded the kernel and did some tests - we had
about 15 of these according to our backup system (which does
MD5 UUID tests on every message and tells us if any files don't
match any index record)

> The race condition is that the exists count in the header can only be 
> updated after the index record has been written.
>
> An explicit seek using mailbox->exists is definitely more robust, although 
> it probably doesn't help if power fails halfway through the fsync() on the 
> cyrus.index file after both updates have been made (data=journal maybe?)

Consider the following mailbox - exists: 3, uids 12, 13, 14

Master         Replica
E: 3           E: 3
12             12
13             13
14             14
(15)

The (15) there is a phantom record as we shut them down for a
"controlled" failover.

Replica        Master
E: 3           E: 3
12             12
13             13
14             14
(15)

There's another delivery:

Replica        Master
E: 4           E: 4
12             12
13             13
14             14
(15)           15
15

And another couple:

Replica        Master
E: 6           E: 6
12             12
13             13
14             14
(15)           15
15             16
16             17
17             

We fail back:

Master         Replica
E: 6           E: 6
12             12
13             13
14             14
(15)           15
15             16
16             17
17             

And have a delivery (which uses index offset):

Master         Replica
E: 7           E: 7
12             12
13             13
14             14
(15)           15
15             16
16             17
18             18

17. < NO INDEX ENTRY

The user deletes message 15:

Master         Replica
E: 6           E: 6
12             12
13             13
14             14
15             16
16             17
18             18
EXPUNGE:
(15)           15

17. < NO INDEX ENTRY

cyr_expire runs:

Master         Replica
E: 6           E: 6
12             12
13             13
14             14
15 < NO FILE   16
16             17
18             18

As you can see, this isn't trivial at a site of our size, it's 
a pretty big mess for the affected mailbox - especially if 
it's uncaught until cyr_expire runs and nukes the underlying
rfc822 file due to duplicate entries in the index and expunge 
files.  I think reconstruct can fix most of it except the
"uid in both index and expunge" - so I fix that by removing the
record from the expunge file first (using my Cyrus::IndexFile
perl module)

Presumably, if there was a multi-append you could even get out
of order UIDs in a pathological enough series of failovers.

Bron.

*IMPORTANT* - bugfix sync_append_commit index breakage

IMPORTANT - bugfix sync_append_commit index breakage