skiplist2 file format - and a question

Henrique de Moraes Holschuh hmh at debian.org
Sat Oct 1 16:31:44 EDT 2011


On Sat, 01 Oct 2011, Bron Gondwana wrote:
> On Sat, Oct 01, 2011 at 02:03:32PM -0300, Henrique de Moraes Holschuh wrote:
> > On Sat, 01 Oct 2011, Bron Gondwana wrote:
> > > Do you see a case where the alignment of the "const char *" for
> > > value is significant?
> > 
> > Yes, it is.  How much depends on the arch, some will just be slowed down in
> > some specific cases (x86-64), others will actually bus-fault, so the
> > compiler will have to change an unaligned access into two aligned access
> > plus magic which translates to even slower access to unaligned data.
> 
> What I mean is, given this data:
> 
> key = "abcdefghi", keylen = 9
> value = "the quick brown fox jumped over the lazy dog", keylen = 44
> 
> There are two possible on-disk representations (as bytes):
> 
> "abcdefghithe quick brown fox jumped over the lazy dogXXX", len = 56
> "abcdefghiXXXXXXXthe quick brown fox jumped over the lazy dogXXXX", len = 64

I thought you meant the alignment for the the (char *), not for the char it
references.

If you're going to access "value" every time you access "abcdefghi", and you
pack them together, you will get more of "value" hot in the cache, as the
CPU will load into L3->L2->L1 a block of cacheline size (which is 64 BYTES
on modern Intel Xeon, for example).

So, the real question is: how much of "value" are you going to be accessing
most of the time, and will the addition of the padding cause it to require a
second cacheline fetch?

Aligned *(char *) access _can_ be faster for string operations: you're not
accessing *(char *), you are acessing a number of characters starting at
(char *), and either the microcode or libc may well decide to optimize it
into 32-bit or 64-bit compares when possible.

So, if the padding will not cause it to spill into a new cacheline, do the
padding.

> > However, anything high-performance (x86-64 included) does memory access on
> > cacheline-sized units anyway, so you should align data structures inside the
> > mmap()'d file based not only on field-type alignment (to avoid giving the
> > compiler even more reason to output crappy code), but also align groups of
> > fields to cacheline boundaries.
> > 
> > I'd recommend using 64 bytes as the cacheline size to optimize for.
> 
> Every record will be 64 bit aligned - the only question is whether
> there is value in aligning the value string inside that record.

64 *bytes* is a much better choice for record alignment, unless the record
is smaller than 32 bytes.  In that case, you'd align the record to 32 bytes,
and align groups of records that are likely to be accessed in sequence at 64
bytes.

This is valid for a processor with a cacheline size of 64 bytes.

> > > Still only space for 32 bits worth of records, so we're limited at 4
> > > billion records per skiplist.  I can live with that.
> > 
> > If you have space left until the next record considering cacheline record
> > alignment, maybe you could add an extra 32bits of _zero_ padding to that it
> > is actually ready to be used as a 64 bits field...
> 
> We could change the count field to be 64 bits actually, and put the version
> at the end next to the CRC32.

Would it be better for data locality?  If it is not going to be worse, IMHO
it is probably best to just go 64-bit already.

> > Still, can you make a synthetic workload simulator for this?  You'd be able
> > to easily check the effect of various record and field padding strategies,
> > as well as use hardware profilling to ask the CPU about cache misses, etc.
> 
> I suspect any CPU issue

It is the best way to know for sure, and if it is self-contained, you can
easily request test runs on different CPUs.

> > But _do_ make sure the important part of the file will align well with the
> > cacheline boundaries, or performance will tank.
> 
> Everything will align at 64 bit boundaries.

That's the ideal alignment for field access, but you want one extra
alignment: you want all fields that you're going to access close in time
(or, if they're small enough, groups of records that you're going to access
close in time) to be inside as few cachelines as possible.  That means
cacheline-size alignment is also important.

Cachelines are large, a typical performance x86-64 has a cacheline size of
64 *bytes*, not bits.  All memory access (DRAM, L3, L2 and L1) is done
entire cachelines at a time.

I'm sure you know all that, but since there was at least one 64-byte/64-bit
confusion, I'd rather write it down for future reference if anyone reads
this thread in the archives.

> > What is the scenario you're trying to protect from?
> 
> Basically a partial write leaving the DB in an inconsistent state which
> a later load can't detect from.  Here's the skiplist 1 version of the
> SAFE_TO_APPEND function:

...

> In skiplist2, there's no such distinction, we ALWAYS end with a COMMIT after
> either creating the file, or repacking it - so we don't need to check
> against logstart, and we don't special-case DELETE, so it's just:
> 
> /* is this a complete file? */
> static int SAFE_TO_APPEND(struct db *db)
> {
>     /* check it's bigger than the header */
>     if (db->map_size <= HEADER_SIZE)
> >-------return 1;
> 
>     /* and it must be multiple of 8 bytes */
>     if (db->map_size % 8)
> >-------return 1;
> 
>     /* and MUST end with a commit */
>     if (*((uint64_t *)(db->map_base + db->map_size - 8)) != COMMIT)
> >-------return 1;

You might want to validate the checksum/CRC/hash/whatever of the last block,
as well.

Notice that if cacheline alignment is being used for the record, and the
record fits inside a single cacheline, you will have the entire record
already on L1 cache if you try to read _any_ of the record's fields, so
checksumming/CRC-checking/hashing it can happen entirely from L1 cache.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh


More information about the Cyrus-devel mailing list