cyrus.expunge/cyrus.cache mistmatch on replica

Bron Gondwana brong at fastmail.fm
Mon Sep 24 02:59:18 EDT 2007


We find all the fun bugs :)


Picture if you will...

sync_client finds a message missing on the replica during a mailboxes event and
causes a sync_combine_commit to be called on the replica to absorb the new
messages.

* cyrus.index is re-written with new cache_offset values for each message
* cyrus.cache is re-written with just the records in cyrus.index
* cyrus.expunge is ***ERROR, BOGUS CACHE FILE OFFSETS***


Along comes cyr_expire and reads those offsets, causing either random breakage
(thankfully relatively unlikely) or segfaults (before I wrote a patch to notice
and log instead).

Well, this week we got about 30 new ones.  I thought that was a bit rich, so
I spent today investigating and found the above.  I'm not sure why
sync_combine_commit was called, and hopefully my better largeappend patch will
cause it to be called less often, but still...

Do you guys suggest anything?  Ideas I've come up with include:

a) cyrus.expunge cache offsets are always bogus - if you're undeleting then
   just go calculate yourself a new cache record.

b) sync_server could read cyrus.expunge and copy the cache records across,
   then rewrite it same as it does for the index in combine_commit.

c) (I'd love this, but it's a lot of work) get rid of cyrus.expunge.  Leave
   the records in cyrus.index in order.  You'd need some "expunged" count so
   you would reply: "EXISTS %d", (mailbox->exists - mailbox->expunged), and
   any IMAP command that indexed by MSGNO would have to walk the mailbox
   counting valid messages rather than just seek to an offset - but everyone
   sane already uses UID, and you already have to do that for UID.
   Obviously, would need an "expunged" flag in each index record too.

   Why I like this? 
    - seen file updates wouldn't remove the status for the
      deleted messages until they were actually expunged.
    - Sync client could copy expunged messages just like that.
      Increase failover reliablity and mean there isn't a gap
      where they may never get copied if they're created and
      expunged between a sync.
    - Expunge would be even cheaper in the immediate response,
      it's just a flag update.
    - We wind up re-writing the entire index and cache files at
      cyr_expire time anyway, so it's no more expensive there.
      Less actually because there's one fewer file to write out.
    - One index file that you can audit against the data files
      on disk.
    - Everything in order all the time rather than unsorted
      cyrus.expunge.

Bron.
-- 
  Bron Gondwana
  brong at fastmail.fm



More information about the Cyrus-devel mailing list