skiplist2 file format - and a question

Bron Gondwana brong at fastmail.fm
Sat Oct 1 10:28:58 EDT 2011


Hi all,

I've been playing around with a file format for skiplist2.  I've mostly 
nailed down the details now, but I want an opinion:

Do you see a case where the alignment of the "const char *" for value is 
significant?

The answer to this will determine the amount of padding required per record.


Now - first the justifications:

a) 64 bit pointers, https://bugzilla.cyrusimap.org/show_bug.cgi?id=3494
b) reduce IO for random searches
c) checksums to detect file corruption

And a constraint: don't waste too much space per record.

So the first thing is to consider which fields don't really need to take 32
bits.  I did chop up the header slightly, but then added more space back by
allowing the file to be up to 24 skip levels deep due to the new 64 bit 
nature.

/* offsets of header files */
enum {
     OFFSET_HEADER = 0,
     OFFSET_VERSION = 16,
     OFFSET_VERSION_MINOR = 18,
     OFFSET_NUM_RECORDS = 20,
     OFFSET_LOGSTART = 24,
     OFFSET_LASTRECOVERY = 32,
     OFFSET_FLAGS = 40,
     OFFSET_CRC32 = 44
};

So 16 bits each for VERSION and VERSION MINOR (currently at 2 and 1 
respectively, previously 1 and 2)
Still only space for 32 bits worth of records, so we're limited at 4 
billion records per skiplist.  I can live with that.

Timestamp and logstart are 64 bit now.

There's space to keep some "flags" - not sure what to put there, but one 
thing I was thinking was "sort by level before key on repack" - which 
would make random key searches more linear, but make "foreach" more 
random.  Oh,
and the magic got 4 characters shorter.

OK - now the fields themselves.  First the "type"s.

#define DUMMY 0
#define COMMIT 1
#define ADD 2
#define DELETE 4
/* delete + add */
#define REPLACE 6

These go into 8 bits.  REPLACE is an ADD with an extra field for the 
offset being deleted.

The header of a record is 64 bits like so:

     /* split this into pieces */
     record->vallen = ntohl(*((uint32_t *)(db->map_base + 
record->offset    )));
     record->keylen = ntohs(*((uint16_t *)(db->map_base + record->offset 
+ 4)));
     record->level =       (*((char *)    (db->map_base + record->offset 
+ 6)));
     record->type =        (*((char *)    (db->map_base + record->offset 
+ 7)));

But with overflow checks to allow larger values and keys:

     /* check for value overflow */
     if (record->vallen == UINT32_MAX) {
         if (offset + record->len + 8 > db->map_size)
             goto badsize;
         record->vallen = ntohtll(*((uint64_t *)(db->map_base + 
record->offset + record->len)));
         record->len += 8;
     }

     /* check for key overflow */
     if (record->keylen == UINT16_MAX) {
         if (offset + record->len + 8 > db->map_size)
             goto badsize;
         record->keylen = ntohtll(*((uint64_t *)(db->map_base + 
record->offset + record->len)));
         record->len += 8;
     }

Then after comes the DELETE pointer if it's a delete record:

     /* check for delete pointer */
     if (record->type == DELETE || record->type == REPLACE) {
         if (offset + record->len + 8 > db->map_size)
             goto badsize;
         record->deloffset = ntohtll(*((uint64_t *)(db->map_base + 
record->offset + record->len)));
         record->len += 8;
     }

And the skip pointers are BEFORE the key and value:

     for (i = 0; i < record->level; i++) {
         if (offset + record->len + 8 > db->map_size)
             goto badsize;
         record->offsets[i] = ntohtll(*((uint64_t *)(db->map_base + 
record->offset + record->len)));
         record->len += 8;
     }

What this means is that you can follow skip pointers without parsing 
past the key and value.  Indeed, if you have a really long key and 
you're doing strcmp() you may never need to read any more than the first 
few characters of the key, so the following pages won't ever be mapped 
in if this is not the key you are looking for.

Finally the key and value are accessed:

     record->key = db->map_base + record->offset + record->len;
     record->len += record->keylen; /* XXX: unaligned value - is the 
tradeoff good? */

     record->val = db->map_base + record->offset + record->len;
     record->len += record->vallen;

And the entire record is rounded up to a multiple of 8 bytes.

     /* round up the final figure */
     record->len = roundup(record->len);

     /* and make sure the whole rest fits */
     if (offset + record->len > db->map_size)
 >-------goto badsize;


You will notice that all this code uses structures rather than direct 
pointer accesses, and carefully checks at every point that it fits 
within the mmap.  This is basically a copy and paste of read_record() 
from my current dev tree.

So far, this is space-wise equivalent to the old format on average.  The 
old format was:

chars   field
4       TYPE
4       KEYLEN
%4      KEY
4       VALLEN
%4      VAL
4*level PTR
4       -1 (pointer list finished)

Now PROB is 0.5, so the "average" record will have two pointers, so that 
is a total overhead of 6*4 == 24 bytes per message, plus two 4 byte 
roundings.

The new format so far is:

8       HEADER
8*level PTR
%8      KEY+VAL

With two ptrs, that is 3*8 == 24 bytes! plus a single 8 byte rounding.

So the average size is the same.  I am planning to add another 8 bytes 
however, as two CRC32s.  One over the HEADER and PTR part, and a 
separate one over the key and value.  This is so that the first CRC32 
can be re-calculate every rewrite, and the second only when the record 
is created.  So the final format would be:

8       HEADER
8       VALLENEXT?
8       KEYLENEXT?
8       DELPTR?
8*level NEXTPTR
4       CRC32_HEAD
4       CRC32_VAL
%8      key+val

Or if people decide we need two roundings, then we pad the keylen out as 
well, for an average extra cost of 4 bytes per record.

Can anyone see any glaring stupidities in this?


Oh yeah - "dummy" is just a record like everything else, it's the first 
one.  The "level" counter for the database is of course now just the 
"level" field in the dummy record.

And a commit record is, on disk, 8 bytes:  \0\0\0\0\0\0\0\1 - unlikely 
to occur randomly in data!

Bron.



More information about the Cyrus-devel mailing list