unexpunge segfault part 2

Bron Gondwana brong at fastmail.fm
Tue May 5 20:20:39 EDT 2009

On Tue, May 05, 2009 at 07:08:41PM -0400, Zing Zing Shishak wrote:
> Bron Gondwana wrote:
> > On Tue, May 05, 2009 at 01:18:46PM -0400, Zing wrote:
> >> I'm also seeing a segfault (i've seen bus error also) in unexpunge -l when
> >> I set an expire annotation on a mailbox and run cyr_expire.  I'm running
> >> cyrus 2.3.14 + the ipurge patch from Bron on f10 (x86_64), but that
> >> doesn't help (i didn't think it would):
> > 
> > You want the "disable EXPUNGE_FORCE" patch I committed to CVS yesterday :)
> oh, perfect timing. :) That seems to do the trick as a workaround.  thanks.

Cool :)  It seems the most sensible approach - NEVER delete the files on
disk completely unless doing a cyr_expire run.
> > (I've also got patches that turn that crash into a syslogged error instead,
> > but they don't actually solve the corruption)
> good to know.  i can test out any patches if people want to try to solve
> the corruption...

The really interesting ones aren't production tested yet - a complete
rewrite of all cache accesses to go through the one codepath is in
production at FastMail, but the delayed cache loading isn't.

Delayed cache loading is nice, because if you select a mailbox and
never make a query that actually _needs_ the cache, it doesn't get
opened or statted or anything.  Reduces IO.

So anyway - I should get back to work on that soon.  First I need
to figure out what missing sync_log commands are needed to make
CONDSTORE replication reliable.  I've just enabled CONDSTORE for
a sacrificial few thousand users to see what happens :)  Including
me of course!

The new Thunderbird beta supports using it, so I want it on!

> As I was searching the dev archives, a post by James E. Blair last year
> seemed to have an analysis:
> http://lists.andrew.cmu.edu/pipermail/cyrus-devel/2008-September/000935.html

Yes, now that is interesting.  I think I skimmed over it at the time,
but it raises some good points.  The 200Gb virtual file size - sounds
like the "exists" field in the cyrus.expunge file got some totally bogus
value.  Index files get written at an offset calculated by exists rather
than by actual file size so that a failed append doesn't break anything.

I haven't done any work at solving that issue.  The whole expunge codepath,
despite having been cleaned up a couple of times over the years, could still
do with some more TLC!


More information about the Info-cyrus mailing list