seen db

Bron Gondwana brong at fastmail.fm
Wed Jun 11 10:43:57 EDT 2008


On Wed, 11 Jun 2008 15:07:02 +0200, "Rudy Gevaert" <Rudy.Gevaert at UGent.be> said:
> Bron Gondwana wrote:
> 
> > Try a 2.6.20 kernel, just for an interesting datapoint.  We changed
> > back to 2.6.20 (64 bit still) and haven't seen a corrupted seen file
> > since.
> 
> I hope to try that still today.
> 
> I'm now running on 2.6.24-2, 32bit.  I have cleaned up the users that 
> were having a corrupted mailbox on replica.  Surprisingly I can count 
> them on both hands.
> 
> So now I'm again running with rolling replication and I'm doing a 
> sync_client session for each user.  When that is finnished I'll try to 
> downgrade the kernel.
> 
> Btw, I tested my sarge-> etch upgrade in a xen virtual machine, 64bit 
> kernel + 32 bit userspace.  But this was 2.6.18.
> 
> I'm still wondering if I should run 2.6.20 in 32bit or 64bit...

It's been fine for us as 64bit for a while now.

Though note - 64bit will allow lots more process space, which allows
broken cache files to REALLY SCREW WITH YOU.  Bah.  We have 4Gb core
dumps being written into our cores directory - and let me tell you,
while something is dumping core it uses some trick which totally
nukes all other IO on the same device.  It gets ioniced up there really
happy.  Ouch.

The cause - mailbox_cache_size hits a bogus "length" field and returns
like 1.7Gb as the size of the record.  This then causes an xrealloc to
"size * 2", or 3.4Gb.

At least in the case of one  mailbox that's been causing us fun.  In
a second I'll gdb that awfully large core and figure out which mailbox
is the culprit.  One reconstruct later....

> >>> Oh - can you tell me.  Did the file checkpoint sometime not too long before it
> >>> got corrupted?
> >> The cases I saw it did.
> > 
> > Ditto here.  Interesting.  They also had quite long records, but
> > I don't know how common that is.  Lots of little bits of seen
> > spread around the space.
> 
> I'm not sure how I would see that?  I'm not familiar with the internals 
> of skiplist.

I find they show up pretty well as ^@^@^@^@^@^@ in less.  The skiplist format
doesn't have many all zero blocks otherwise.  Lots of other special characters
show up for binary bits.

Sadly, I can pretty much read a hexdump of a skiplist.  Sad because that's a
lot of braincells that could be doing something useful like absorbing alcohol.


I've written a little patch for the mailbox_cache_size issue that returns 0
if the result ever looks like it's going negative or more than 100 million
bytes.  Then sync_support is patched to treat a zero cache size as "say we
failed to reserve this message".  It will do for now...

Bron ( also found a theoretical bug in the skiplist code and patched it today,
       but I might fix the whole function before I submit it upstream.
       I say theoretical because I don't see that the codepath gets exercised
       unless you already have a corrupt file, so meh )
-- 
  Bron Gondwana
  brong at fastmail.fm



More information about the Info-cyrus mailing list