FastMail Cyrus Patches for upstream

Thu Aug 27 18:55:31 EDT 2009

On Thu, Aug 27, 2009 at 01:58:27PM -0400, Brian Awood wrote:
> 
> On Wednesday 26 August 2009 @ 21:42, Bron Gondwana wrote:
> >
> > I have a fix for that which avoids the hilarity.  It's in that
> > queue, and it makes all cache accesses go through one API which
> > does bounds checking.
> 
> I updated my clone of your git repository and took a look at the 
> changes.  Having an API to access cache information will definitely 
> be an improvement over using macros all over the place.  :)  We 
> really appreciate your contributions and usually have at least a 
> couple Fastmail patches in our code base.
> 
> Anyway, I hope you didn't/don't take my comments the wrong way.  I was 
> just hoping that if the next release would be bumping the index 
> revision, we could also add the cache record lengths at the same 
> time.  We'd be willing to contribute a patch for it.  However, if it 
> looks like no one agrees with our assessment we won't expend the 
> effort, since there's no way we could maintain our own revision of 
> the meta files.

Feedback is great.  I really appreciate it.  I originally actually
would have agreed with you that the cache record length should be in
the index record (possibly replacing one of header_size or content_offset,
since they're always the same!) - but I don't think the argument for
having it holds water.

a) it's a pain to update.  For cache_crc I've special-cased "0" so we
don't have to read every single cache record.  You'd have to do the same
here, or have an "options" value like pop3_uidl so that a mailbox could
be "upgraded" during a process_records run or something.  Don't worry,
I've already considered this for my "get shared seen back into the
index file rather than in a bloody huge non-findable skiplist file that
can't be replicated to a new server easily".  It's doable.

but:

b) it's un-necessary.  If the record is not broken, then parsing out the
length only takes 10 operations, and it's not like we have to do it that
often.

> Re-reading what I wrote below, I realize that I made it sound like our 
> cache offsets are out of bounds of the file, but that isn't the case.  
> The offsets we see are completely valid and point to the correct 
> cache records.  

So there shouldn't be any problem then...

> > > Initially, it seems like index might be corrupted with a bogus
> > > offset for message 110, which just happens to be within range,
> > > but it isn't. The record it points to is valid and matches the
> > > message on disk.  So it seems like there's just a bug somewhere
> > > causing the cache file to be written out of order, but I think
> > > there are possibly other issues being caused by calculating the
> > > cache record length each time it's needed.  One example might be
> > > that we periodically find a cache file that is unnecessarily
> > > large for the number of messages that are present (sometimes to
> > > the point of not being able to mmap it).
> >
> > No - the calculation of record length each time it's needed is only
> > a problem when the pointer into the cache file is garbage. In which
> > case getting the correct length bit of garbage is still going to
> > ruin your day later on when it tries to parse the internal
> > structure of the cache record and gets all confused.
> 
> I understand your point, but it assumes the cache records are in 
> sequential order which, unfortunately, isn't always the case.  
> Conversely, if you have a bogus offset, using it to calculate the 
> record length will be bogus too.  I could create some examples to 
> demonstrate what I'm trying to explain, if anyone is interested.

Ahh - there were a couple of bugs.  One that wrote everything between
two records to the file, and another that truncated incorrectly on
failed append, leaving all sorts of extra junk in cache files.  They
should both be fixed in 2.3.14 I think.  Though cache files could still
contain junk on disk until a rewrite gets the whole thing!

> > The only thing that makes it work OK is pointing to the correct
> > location.
> >
> > There's a bug in process_records as used by ipurge even in the
> > current release (I think - I pushed the fix to CVS a while back),
> > whereby it used an incompatible set of flags.  Basically you can't
> > do an immediate delete EVER when delayed delete is present, or the
> > cache offsets get screwed. The easy fix was to make ipurge a
> > delayed delete.  I'm not sure that unexpunge is fixed at all...
> > because I don't use it.  But I think it's OK now.
> 
> I remember seeing your post about that, we aren't currently using 
> ipurge, though we give users access to the unexpunge functionality 
> via a webapp.  

Unexpunge is the only thing I can think of that might give out of
order cache records.  Though I _think_ it may actually be totally
broken with respect to cache records at the moment.  I would recommend
a reconstruct after any unexpunge at the moment.

> > > It seems like all these would be non-issues if the length of each
> > > cache record was stored along with the offset in the index file
> > > rather than trying to calculate it based on the offsets.  Does
> > > anyone know if there a reason the cache record length isn't
> > > stored along with the offset?
> >
> > No, they'd still be issues.  The correct fix is to ensure there's
> > no code path where legitimate use creates invalid cache pointers. 
> > The CRC checking is orthagonal to that except that it will make it
> > even more clear when (if, say if) we mess it up.
> 
> The point I'm trying to make is that the offsets are legitimate, it's 
> just that they aren't in the order the code assumes them to be.  Take 
> for example if a user expunges every other message, or all of the 
> most recent messages except for the first few.  Then they copy the 
> remaining messages to another mailbox, the code ends up coping all of 
> the cache records, instead of just the records of the remaining 
> messages.  I'm not sure yet how we end up with offsets for higher 
> UIDs which are smaller than offsets for lower UIDs, but when you 
> calculate the record length with two of these valid offsets, the 
> length is definitely not valid!  Passing a negative length to write() 
> is a fail. ;)

Ok - so that's a bug that needs fixing, if it's still present in CVS.

I _thought_ I had fixed it.  Fixing this bug doesn't require having
the record length in the index file though.  It can be done just by
reading the record itself - which we now have a nice easy API for,
which returns the length and does CRC and internal structure checks 
at the same time!

> I suspect you might start to detect at least some of these cases in 
> the form of, apparently incorrect, cache record checksums.

Definitely.  If anything is going odd, chances are the record checksums
will be wrong!

Bron.