*IMPORTANT* - bugfix sync_append_commit index breakage
Bron Gondwana
brong at fastmail.fm
Fri Aug 31 19:14:45 EDT 2007
On Fri, Aug 31, 2007 at 05:51:35PM +0100, David Carter wrote:
> On Fri, 31 Aug 2007, Ken Murchison wrote:
>
>> I think your patch makes sense, but I'm not sure when "The cyrus index
>> file format has this clever "ignore" junk at the end until the exists
>> count changes trick" means. I know we leave junk in the cache file as a
>> result of delayed expunge which gets cleaned up later, but I'm pretty sure
>> the the index file is always tightly packed. The is extra space in the
>> index header, but there shouldn't be any between the index records.
>
> I think that the danger is that if sync_server gets shut down uncleanly
> (which I know was happening to Fastmail a lot at one point) then then you
> can end up with a bogus entry at the end of a cyrus.index file which is not
> overwritten by the next sync_append_commit() on that mailbox.
"Uncleanly" means any time it gets a sigterm in the middle of a
delivery at the wrong point. On a busy server with a lot of IO
this tends to be about every shutdown for a couple of mailboxes
(as a datapoint, we failed over 5 slots on one server for about
2 hours while we upgraded the kernel and did some tests - we had
about 15 of these according to our backup system (which does
MD5 UUID tests on every message and tells us if any files don't
match any index record)
> The race condition is that the exists count in the header can only be
> updated after the index record has been written.
>
> An explicit seek using mailbox->exists is definitely more robust, although
> it probably doesn't help if power fails halfway through the fsync() on the
> cyrus.index file after both updates have been made (data=journal maybe?)
Consider the following mailbox - exists: 3, uids 12, 13, 14
Master Replica
E: 3 E: 3
12 12
13 13
14 14
(15)
The (15) there is a phantom record as we shut them down for a
"controlled" failover.
Replica Master
E: 3 E: 3
12 12
13 13
14 14
(15)
There's another delivery:
Replica Master
E: 4 E: 4
12 12
13 13
14 14
(15) 15
15
And another couple:
Replica Master
E: 6 E: 6
12 12
13 13
14 14
(15) 15
15 16
16 17
17
We fail back:
Master Replica
E: 6 E: 6
12 12
13 13
14 14
(15) 15
15 16
16 17
17
And have a delivery (which uses index offset):
Master Replica
E: 7 E: 7
12 12
13 13
14 14
(15) 15
15 16
16 17
18 18
17. < NO INDEX ENTRY
The user deletes message 15:
Master Replica
E: 6 E: 6
12 12
13 13
14 14
15 16
16 17
18 18
EXPUNGE:
(15) 15
17. < NO INDEX ENTRY
cyr_expire runs:
Master Replica
E: 6 E: 6
12 12
13 13
14 14
15 < NO FILE 16
16 17
18 18
As you can see, this isn't trivial at a site of our size, it's
a pretty big mess for the affected mailbox - especially if
it's uncaught until cyr_expire runs and nukes the underlying
rfc822 file due to duplicate entries in the index and expunge
files. I think reconstruct can fix most of it except the
"uid in both index and expunge" - so I fix that by removing the
record from the expunge file first (using my Cyrus::IndexFile
perl module)
Presumably, if there was a multi-append you could even get out
of order UIDs in a pathological enough series of failovers.
Bron.
More information about the Cyrus-devel
mailing list