On the topic of nested skiplist transactions

Bron Gondwana brong at fastmail.fm
Tue Sep 2 02:29:33 EDT 2008

I've been thinking a bit about transactions and skiplists these past few
days, since sending the patch that refactors the skiplist locking code.

Basically the issue is this:  any piece of code that MIGHT be used
within a transaction currently needs a way to pass that transaction 
down through all the layers.  This messes up the API quite considerably
and means it needs to keep track of whether it was called within a
transaction so it knows if it's supposed to commit or not.

The alternative is nested transactions, so that you have:

int a()
  struct txn *t;
  DB->fetch(db, key, keylen, &val, &vallen, &t);
  // ...
  DB->store(db, key, keylen, val, vallen, &t);
  DB->commit(db, t); // bloody transaction pointer indirection level
                     // changes in the API
  return 0;

int b()
  DB->fetch(db, key2, key2len, &val2, &val2len, NULL);
  return 0;

That currently works now thanks to the "hacky" transaction promotion

The problem is when you want to do:

int b()
  struct txn *t2;
  DB->fetch(db, key2, key2len, &val2, &val2len, &t2);
  // ...
  DB->store(db, key2, key2len, &val2, &val2len, &t2);
  DB->commit(db, t2);
  return 0;

Suddenly you have transaction nesting.

NOW - this _could_ work.

How it would work.

"struct db" would keep a linked list of transactions rather than just a
current_txn pointer.

  struct txn **tidptr;

if (tidptr == NULL) {
  CASE: transactions on the list 
    - read needs no lock.
    - write needs a new transaction on the list that will autocommit at
      the end of the function
  CASE: no transactions on the list
    - read creates a readlock only
    - write needs a new transaction on the list that will autocommit at
      the end of the function
else if (*tidptr == NULL) {
  - add a new transaction to the list
  - do all operations inside the new transaction
  - *tidptr = newtxn;
else {
  CASE: transactions on the list
	  - assert(*tidptr == list_top->tid);
	CASE: no transactions
	  - abort.

OK - now what happens with locks?

  CASE: transactions on the list
    - don't do any more locking on create
  CASE: no transactions
	  - write lock on create

  CASE: this is the last item on the list
	  - unlock after commit
  CASE: there are further transactions
	  - don't unlock


Each transaction has a base pointer, that points to the end of the
skiplist before this transaction started.  When you rollback, you start
from this point and read each record that was appended.  When you hit
the LAST record you reverse its actions, then truncate to that point and
loop until there's nothing left on the end of the file.

With nested transactions, you rollback to the start of THIS transaction
only.  Let me draw a picture:

  STORE 'abc' '123'
	STORE 'foo' '3456'
	DELETE 'abc'
	  DELETE 'foo'
		STORE 'foo' '789'
	STORE 'bar' 'beer, woot!'
	  DELETE 'bar'
		STORE 'bar' 'girly drink'

foo: 789
bar: beer, woot!

Actual on disk data is:

STORE 'abc' '123'
STORE 'foo' '3456'
DELETE 'abc'
DELETE 'foo'
STORE 'foo' '789'
STORE 'bar' 'beer, woot!'

Notice, we don't actually generate a COMMIT record to disk for commits
where there are further transactions, all we do is update the "end of
transaction pointer" on the parent transaction when we commit, rather
than running the rollback logic to remove the unwanted records.  This
means that a crash will never commit the partial transactions.

Sooo... why do we want this (those of you who have followed along this

Basically - it means you can do a self-contained unit of work without
caring whether it's being called in an outer transaction or not.  Just
start your own "transaction" and call commit at the end.  It still won't
be "committed" and accessible to other users until the outer transaction
makes its commit, but it means you don't have to pass around transaction
pointers and worry about return codes and mapping the abort all the way
out.  Your part of the code can appear to have never happened without
needing to roll back the outer transaction and start over.

It's not actually complex implementation-wise, and it shouldn't break
anything currently in existance because trying this now gets you an
assertion failure.

It will also "unhacky" the way that transient non-transactional reads
are rolled into the wrapping transaction.

Thanks for reading!  Sorry it got so long...


More information about the Cyrus-devel mailing list