Skiplists to the Nth degree

Tue Jan 1 22:33:50 EST 2008

Replying to myself yet again - I guess all you other lucky
sods are on holiday or something :)

On Mon, Dec 31, 2007 at 02:10:49PM +1100, Bron Gondwana wrote:
> Ho hum - there's also this little bit of trouble with the 
> SAFE_TO_APPEND code that I discovered reading the code again.
> I don't know if this can explain everything, but it is still
> worth fixing.
> 
> Basically, if you hand a pointer to a null transaction to one
> of the read functions (myfetch, myforeach) then it locks the
> file in write mode and begins a new transaction... but it
> doesn't check SAFE_TO_APPEND.
> 
> Later you do a write, and it still doesn't check SAFE_TO_APPEND
> because it already has a transaction.
> 
> This patch fixes the problem by moving the SAFE_TO_APPEND check
> into newtxn so it gets called from all codepaths that start a
> new transaction.
> 
> Bron.
> 
> http://cyrus.brong.fastmail.fm/patches/cyrus-skiplist-safelock-2.3.11.diff

Speaking of "SAFE_TO_APPEND" - it contains two bugs.  The first
happily protecting the second.

And the whole lot found while looking at a third.

Ok - bug the first.  A logic bug.  If you have a transaction that 
looks like this:

...
COMMIT - 4 bytes (0x000000FF)
DELETE - 4 bytes (0x00000004)
<ptr>  - 4 bytes (0xDEADBEEF)
COMMIT - 4 bytes (0x000000FF)

then that won't look like the pattern:

<-1>   - 4 bytes (0xFFFFFFFF)
COMMIT - 4 bytes (0x000000FF)

which is what SAFE_TO_APPEND is allegedly looking at for the end
of a file.

This was somewhat mitigated by bug the second in this nice complex
little statement:

 || (db->map_size != db->logstart &&
    *((bit32 *)(db->map_base + db->map_size - 8)) != htonl(-1) &&
    *((bit32 *)(db->map_base + db->map_size - 4)) != htonl(COMMIT)));

Which only returns true if BOTH expressions are true.  The stack
of negatives and the complexity of the statement here meant a piece
of logic was lost.  That should be an || on the second last line
to make this work (plus appropriate parentheses)

Or maybe you could just rewrite the whole messy thing as separate
statements and improve the clarity, especially since you really
want to handle the DELETE case once you fix this second bug.

Finally, the third issue which caused me to investigate all this.
After a checkpoint, the first process to write to the skiplist
often seems to do a recovery.

Actually, it's pretty much every time that the first process
already had this DB open, because write_lock might open a new
file, but doesn't re-read the logstart value, so it's looking
for a COMMIT record rather than a final <-1>.

And gosh - now that I think about it, imagine if that new record
also just happened to have a higher level and hence required
the header to be re-written to add a new curlevel.  You'd wind up
with an incorrect logstart value.  I guess that I've found the
underlying cause of incorrect logstart values!

Ok - I've update the "safelock" patch referenced above with the
fixes for all these.  Basically, every time a read_lock or
write_lock is made, the entire header is re-read.  Should be
pretty cheap really (even for foreach statements)

Bron.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cyrus-skiplist-safelock-2.3.11.diff
Type: text/x-diff
Size: 4967 bytes
Desc: not available
Url : http://lists.andrew.cmu.edu/pipermail/cyrus-devel/attachments/20080102/4f3c71b2/attachment.bin