Cyrus database and file usage data

Sun Jun 12 10:05:13 EDT 2016

On Fri, Jun 10, 2016, at 10:15, Bron Gondwana via Cyrus-devel wrote:
> On Fri, Jun 10, 2016, at 06:40, Thomas Jarosch via Cyrus-devel wrote:
> > Hi Bron,
> > 
> > Am 08.06.2016 um 08:22 schrieb Bron Gondwana via Cyrus-devel:
> > > *THE PLAN[tm]***
> > >  
> > > For JMAP support, I'm going to discard the existing conversations DB and
> > > create a sqlite database per user which contains everything of value. 
> > 
> > one thing to watch out for with sqlite:
> > 
> > It doesn't scale easily with multiple processes accessing the same DB.
> > The write-lock timeout is short by default and a "modifying"
> > query might error out.
>
> Yeah, I know - which is why I've been locking around it with an exclusive
> file lock so only one process can hit it at a time.
> 
> You'd think that would ruin performance, but I haven't actually had too
> much trouble.  The conversations DB is already a per-user exclusive lock
> whenever you've got any mailbox open right now.

The more I think about this, the more I'm worried that it's a half-arsed solution.

I already knew it was a stopgap to allowing a fully stateless server.  To be able
to synchronously "backup" to another server means we need to cheaply sync
the state to some central server.

Which basically means log shipping.  You can do that pretty quickly with the
skiplist/twoskip format by just saving the end of the file each time, and having
the "restore from hard crash" process be a recovery2 on the file - walking the
file and applying every change while ignoring the pointers entirely.

But mixing that in with sqlite3 is tricky, and it's even trickier if you want to
change to another backend.

Sqlite "INTEGER" types also cost at least 8 bytes each, so you're already spending
a lot of space or you're still packing bitfields to store flag information.

So I think I'm going to throw away everything I've done so far, and go back to
basics:

* 1 database file per user (or per top-level shared folder name for non-user folders)
* 1 mailboxes database file for the server
* 1 temporary data file for the server (aka: delivery.db, tls_cache, etc) - these don't need to be durable

* optional: writeback to object storage for EVERYTHING on commit, so that you never lose data in any server crash situation

Let's break this back a little bit:

1 database file per user:

- actually this is probably a couple of files, because there's at least three very distinct classes of data:
* cache data
* emails
* index data (including annotations)
* multiple cache files per user - probably not even per-mailbox, but just for the entire user.  A repack strategy which keeps things in the order they're likely to be requested by clients.

It would be really nice to require indexes, but actually with a key-value format that allows prefix scans (cyrusdb_foreach) you can implement indexes very easily.  Sure it's more work than just writing SQL, but with transactions it's just as reliable if the code is good.  We'll be reconstructing those files in audit mode enough to be sure of that I hope :)

...

If twoskip is too slow (possible), then I've been quite interested in looking at rocksdb (http://rocksdb.org/) as an embedded engine that has really good performance, prefix scanning, and a good community around it.  It's also quite compatible with object storage because all but the level0 "hot" databases are read-only, so you can store them as objects once and then not need to scan them again.

An alternative there is multi-level databases in the same way we have the search tiers - with offline repack and substituting a new database with identical contents (minus dead records) atomically in the way that we do it with search.  This eliminates the stop-the-world repacks that occasionally hit us with both cyrus.index/cyrus.cache and all the twoskip/skiplist databases, because
repack can be done in the background to new read-only files, with all writes happening to a small level0 database.

We already kinda do this in-memory for cyrus.index now, with a hash that gets looked up for every read.

And that's about where my thinking is :)  It's more work now, but it gets us to a fully object-storage-backable system a lot faster.  We could then have replication mainly be used to trigger a pull from object storage and heating of the same files so that failover was clean.

...

I still want a global mailboxes state database, which would be a distributed database rather than the current murder arrangement.  This is in ADDITION to the per-machine mailboxes.db, and would be read-only, along with a locking service which pinned each user/top-level-shared to a single machine in the cluster and a way to transfer individuals locks or bulk blocks of locks between machines as failover.  Something like etcd/consul seems the right choice here.  This is definitely phase2, I'm just keeping it mind as I design this change.

It is a massive change to the on-disk data formats!  We'd be left with basically:

* key value stores
* cache format (multiple fixed-length binary items per file with file number + offset addressing)
* rfc822 messages (either stick with one-file-per-message or do some MIX style multiple-per-file - this can be independent)

By making every database a key-value store (including the DAV databases - I would subsume them into the userdb) there's only the two data formats to even care about backing up - and there are tons of distributed key-value stores that could already be plugged in directly through the cyrusdb interface if you wanted to!

Bron.

-- 
  Bron Gondwana
  brong at fastmail.fm