skiplist2 file format - and a question

Sat Oct 1 13:45:24 EDT 2011

On Sat, Oct 01, 2011 at 02:03:32PM -0300, Henrique de Moraes Holschuh wrote:
> On Sat, 01 Oct 2011, Bron Gondwana wrote:
> > Do you see a case where the alignment of the "const char *" for
> > value is significant?
> 
> Yes, it is.  How much depends on the arch, some will just be slowed down in
> some specific cases (x86-64), others will actually bus-fault, so the
> compiler will have to change an unaligned access into two aligned access
> plus magic which translates to even slower access to unaligned data.

What I mean is, given this data:

key = "abcdefghi", keylen = 9
value = "the quick brown fox jumped over the lazy dog", keylen = 44

There are two possible on-disk representations (as bytes):

"abcdefghithe quick brown fox jumped over the lazy dogXXX", len = 56
"abcdefghiXXXXXXXthe quick brown fox jumped over the lazy dogXXXX", len = 64

Where "X" is padding.  The difference in the second case is that the value
is aligned on 64 bit boundaries as well.  I'm assuming this only really
matters if you're actually packing larger binaries into the value field.

> I know at least PPC and SPARC64 "dislike" unaligned access.

For a char *?  What if you have things like we do now with cyrus.cache
where the leading and trailing '(' ')' get "removed" by just passing
(ptr + 1, len - 1) to the next processing function?  Because we're talking
about character strings here.

> However, anything high-performance (x86-64 included) does memory access on
> cacheline-sized units anyway, so you should align data structures inside the
> mmap()'d file based not only on field-type alignment (to avoid giving the
> compiler even more reason to output crappy code), but also align groups of
> fields to cacheline boundaries.
> 
> I'd recommend using 64 bytes as the cacheline size to optimize for.

Every record will be 64 bit aligned - the only question is whether
there is value in aligning the value string inside that record.

> > Still only space for 32 bits worth of records, so we're limited at 4
> > billion records per skiplist.  I can live with that.
> 
> If you have space left until the next record considering cacheline record
> alignment, maybe you could add an extra 32bits of _zero_ padding to that it
> is actually ready to be used as a 64 bits field...

We could change the count field to be 64 bits actually, and put the version
at the end next to the CRC32.

> > Can anyone see any glaring stupidities in this?
> 
> I found none.
> 
> Still, can you make a synthetic workload simulator for this?  You'd be able
> to easily check the effect of various record and field padding strategies,
> as well as use hardware profilling to ask the CPU about cache misses, etc.

I suspect any CPU issue

> > Oh yeah - "dummy" is just a record like everything else, it's the
> > first one.  The "level" counter for the database is of course now
> > just the "level" field in the dummy record.
> 
> Fill some of the dummy record with constant data so that it can be the
> magic and file header, maybe?

No, dummy can't start at 'zero' because that clashes with the "end of
records" marker.

> But _do_ make sure the important part of the file will align well with the
> cacheline boundaries, or performance will tank.

Everything will align at 64 bit boundaries.

> > And a commit record is, on disk, 8 bytes:  \0\0\0\0\0\0\0\1 -
> > unlikely to occur randomly in data!
> 
> What is the scenario you're trying to protect from?

Basically a partial write leaving the DB in an inconsistent state which
a later load can't detect from.  Here's the skiplist 1 version of the
SAFE_TO_APPEND function:

static int SAFE_TO_APPEND(struct db *db)
{
    /* check it's a multiple of 4 */
    if (db->map_size % 4) return 1;

    /* is it the beginning of the log? */
    if (db->map_size == db->logstart) {
        if (*((uint32_t *)(db->map_base + db->map_size - 4)) != htonl(-1)) {
            return 1;
        }
    }

    /* in the middle of the log somewhere */
    else {
        if (*((uint32_t *)(db->map_base + db->map_size - 4)) != htonl(COMMIT)) {
            return 1;
        }

        /* if it's not an end of a record or a delete */
        if (!((*((uint32_t *)(db->map_base + db->map_size - 8)) == htonl(-1)) ||
              (*((uint32_t *)(db->map_base + db->map_size -12)) == htonl(DELETE)))) {
            return 1;
        }
    }

    return 0;
}

In skiplist2, there's no such distinction, we ALWAYS end with a COMMIT after
either creating the file, or repacking it - so we don't need to check
against logstart, and we don't special-case DELETE, so it's just:

/* is this a complete file? */
static int SAFE_TO_APPEND(struct db *db)
{
    /* check it's bigger than the header */
    if (db->map_size <= HEADER_SIZE)
>-------return 1;

    /* and it must be multiple of 8 bytes */
    if (db->map_size % 8)
>-------return 1;

    /* and MUST end with a commit */
    if (*((uint64_t *)(db->map_base + db->map_size - 8)) != COMMIT)
>-------return 1;

    return 0;
}

(yes, the reversed boolean logic is evil)

The thing is - you could THEORETICALLY have 0x0000000000000001 as a
value within data, so a write that ends there could be considered a
commit.  There's no way around that except putting a guard value on
the end of every record though - and even then your guard value
could theoretically appear inside a record too.  So it's only a
heuristic guess that we're valid.

Bron ( aka the reiserfsck problem when you have a loopback reiserfs
       inside your reiserfs - yo dawg, I heard you liked skiplists so
       I stored a skiplist inside your skiplist, or something )