cyrus.expunge/cyrus.cache mistmatch on replica
Bron Gondwana
brong at fastmail.fm
Mon Sep 24 02:59:18 EDT 2007
We find all the fun bugs :)
Picture if you will...
sync_client finds a message missing on the replica during a mailboxes event and
causes a sync_combine_commit to be called on the replica to absorb the new
messages.
* cyrus.index is re-written with new cache_offset values for each message
* cyrus.cache is re-written with just the records in cyrus.index
* cyrus.expunge is ***ERROR, BOGUS CACHE FILE OFFSETS***
Along comes cyr_expire and reads those offsets, causing either random breakage
(thankfully relatively unlikely) or segfaults (before I wrote a patch to notice
and log instead).
Well, this week we got about 30 new ones. I thought that was a bit rich, so
I spent today investigating and found the above. I'm not sure why
sync_combine_commit was called, and hopefully my better largeappend patch will
cause it to be called less often, but still...
Do you guys suggest anything? Ideas I've come up with include:
a) cyrus.expunge cache offsets are always bogus - if you're undeleting then
just go calculate yourself a new cache record.
b) sync_server could read cyrus.expunge and copy the cache records across,
then rewrite it same as it does for the index in combine_commit.
c) (I'd love this, but it's a lot of work) get rid of cyrus.expunge. Leave
the records in cyrus.index in order. You'd need some "expunged" count so
you would reply: "EXISTS %d", (mailbox->exists - mailbox->expunged), and
any IMAP command that indexed by MSGNO would have to walk the mailbox
counting valid messages rather than just seek to an offset - but everyone
sane already uses UID, and you already have to do that for UID.
Obviously, would need an "expunged" flag in each index record too.
Why I like this?
- seen file updates wouldn't remove the status for the
deleted messages until they were actually expunged.
- Sync client could copy expunged messages just like that.
Increase failover reliablity and mean there isn't a gap
where they may never get copied if they're created and
expunged between a sync.
- Expunge would be even cheaper in the immediate response,
it's just a flag update.
- We wind up re-writing the entire index and cache files at
cyr_expire time anyway, so it's no more expensive there.
Less actually because there's one fewer file to write out.
- One index file that you can audit against the data files
on disk.
- Everything in order all the time rather than unsorted
cyrus.expunge.
Bron.
--
Bron Gondwana
brong at fastmail.fm
More information about the Cyrus-devel
mailing list