A "Threeskip" database format (because there can never be too many skips)

Bron Gondwana brong at fastmailteam.com
Sun Jun 28 09:55:28 EDT 2020


I've been thinking about this one for a very long time. It's still not going to be as excellent as zeroskip, but it has the advantage that it's still a single file, and it fixes two of the three defects of twoskip:

Those three defects are:
1) poor write locality - updates happen throughout the file
2) long repack freeze gives unpredictable performance
3) no MVCC reads - writers block all readers.

Threeskip will fix two of those! Numbers 2 and 3.

For comparison, here's the twoskip ondisk format:

* HEADER: 64 bytes
* magic: 20 bytes: "4 bytes same as skiplist" "twoskip file\0\0\0\0"
* version: 4 bytes
* generation: 8 bytes
* num_records: 8 bytes
* repack_size: 8 bytes
* current_size: 8 bytes
* flags: 4 bytes
* crc32: 4 bytes
*
* RECORDS:
* type 1 byte
* level: 1 byte
* keylen: 2 bytes
* vallen: 4 bytes
* <optionally: 64 bit keylen if keylen == UINT16_MAX>
* <optionally: 64 bit vallen if vallen == UINT32_MAX>
* ptrs: 8 bytes * (level+1)
* crc32_head: 4 bytes
* crc32_tail: 4 bytes
* key: (keylen bytes)
* val: (vallen bytes)
* padding: enough zeros to round up to an 8 byte multiple


Threeskip is almost exactly the same, with three types (and of course with 'threeskip file\0\0\0' in the header)

RECORD
NEWVALUE
DELETE

RECORD:
* type 1 byte
* level: 1 byte
* keylen: 2 bytes
* vallen: 4 bytes
* <optionally: 64 bit keylen if keylen == UINT16_MAX>
* <optionally: 64 bit vallen if vallen == UINT32_MAX>
** latestptr: 8 bytes (new)*
* ptrs: 8 bytes * (level+1)
* crc32_head: 4 bytes
* crc32_tail: 4 bytes
* key: (keylen bytes)
* val: (vallen bytes)
* padding: enough zeros to round up to an 8 byte multiple

NEWVALUE:
* type: 1 byte
* padding: 3 bytes
* vallen: 4 bytes
* <optionally: 64 bit vallen if vallen == UINT32_MAX>
** prevptr: 8 bytes*
* crc32_head: 4 bytes
* crc32_tail: 4 bytes
* val: (vallen bytes)
* padding: enough zeros to round up to an 8 byte multiple +1

DELETE:
* type: 1 byte
* ZEROS: 7 bytes
* prevptr: 8 bytes
* crc32_head: 4 bytes
* crc32_tail: 4 bytes (ZERO)

This gives us MVCC! When you start a read transaction, the transaction includes the length of the file at the time of transaction start. The reader holds the same filehandle open, meaning it can keep reading even after a repack. If it reads past the transaction length, it just moves on to the next record. If it follows the latestptr past the transaction length, it just follows the prevptr back until it's at the latest value within the transaction.

(DELETE already had a prevptr, but it was less powerful)

*DUAL LOCKING
*

With this comes a second MVCC help. We use two different writelocks, which can be achieved with ranged flock within the header. Lock number 1 is held for the entire transaction and avoids any other writer. Lock number 2 is ONLY held for the duration of writing a single record and updating the related pointers. Readers take a shared lock on number 2 only, which allows them to guarantee clean reads.

*NON-BLOCKING REPACK*

The repack process can take an MVCC read and stream those records to a new file $DB.new (on which it holds an exclusive lock to avoid racing repacks), and then it can replay the log from there. Even the log replay can be done with read locks if the log length is over a certain amount (though this can lead to starvation, so should be a limited number of times) . The final fsync and rename needs to be done with a writelock on the source file of course.

Reading the log, if we hit a NEWVALUE or DELETE, we follow the prevptrs back to find out the key that it references.

It might be necessary to have a "repackd" which does the actual repacking, or some other way of a process saying "I don't mind waiting on the repack" because otherwise the repack freeze will still be visible to the current client, even if other readers and writers can continue.

*STORAGE COST COMPARISON**
*

This adds an extra 8 bytes to the overhead for every CREATE record. The DELETE record is the same size. The NEWVALUE record no longer has the cost of an extract copy of the key in it. So it will use slightly more storage on newly repacked databases.

(The base-level average overhead of a twoskip record is 40 bytes, this makes it 48)

*COMPUTATIONAL AND IO COST COMPARISON**
*

Calculating "repack_size" will be a little more costly, but not too bad.

Replacing values will be much cheaper because we don't need to update pointers. In theory we could even "row lock" or "gap lock" if we wanted other concurrency levels.

*CRC ALGORITHM**
*

Since we have a choice with a new format, clearly we'd go with something fast! CRC32c with hardware acceleration all the way.

*CRASH SAFETY AND THE OTHER TWOSKIP BENEFITS*

All still there just as much as ever. Of course, write locality is helped slightly (overwrite and delete only touch the current record) but it's still the issue that will make zeroskip a better DB format.

Bron.

--
 Bron Gondwana, CEO, Fastmail Pty Ltd
 brong at fastmailteam.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.andrew.cmu.edu/pipermail/cyrus-devel/attachments/20200628/8f709fc9/attachment.html>


More information about the Cyrus-devel mailing list