cyrus.index version 14 and cyrus.cache upgrades

Bron Gondwana brong at fastmail.fm
Sun Apr 3 08:21:47 EDT 2016


(this is a discussion piece for talking about in tomorrow's meeting,
which IS happening at the regular 10pm Melbourne time - that's now
ANOTHER hour later for everyone due to timezones changing.  I
haven't written any code yet)

On top of Robert's work to support libicu for charset conversion and
pick up all the rest of the character sets it supports, we need to make
some cache format changes.

I also have a user at FastMail with a 3.8 million message "Deleted
Messages" folder, and I can't keep manually splitting giant folders for
people just because their cyrus.cache file gets over 4 gig.

So I'm proposing the following changes:

cyrus.index version 14:

cyrus.index header:

* LAST_APPEND_DATE: 32 bit => 64 bit time_t
* POP3_LAST_LOGIN: 32 bit => 64 bit time_t
* LEAKED_CACHE: remove
* FIRST_EXPUNGED: 32 bit => 64 bit time_t
* LAST_REPACK_TIME: 32 bit => 64 bit time_t
* HEADER_FILE_CRC: remove
* RECENT_TIME: 32 bit => 64 bit time_t
* POP3_SHOW_AFTER: 32 bit => 64 bit time_t
* add UNIQUEID: 40 characters (enough space for a uuidgen UUID or
  whatever)
* add a bunch of space for un-fixed-width quotaroot and flag names.

By doing this, we no longer have a separate cyrus.header and
cyrus.index. We only have ONE file in which facts are stored (except
cyrus.annotations, but I have plans for that too).

If the non-fixed data gets too big then we create a new file called
cyrus.indexoverflow which contains just the non-fixed data.  This is
another 99%/1% case.  In 99% of cases we won't create enough (i.e. long
flag names) to fill the space.  If we fix the header size at 2048 bytes,
we save in the common case of an almost empty mailbox, while still
working for huge mailboxes.

There's a mailbox options flag to say to read from the
indexoverflow file.

ACL is no longer stored in this file.  It's not a property of the
mailbox in any meaningful way - it belongs out in mailboxes.db and the
next layer up (eventually).

mailboxname probably will get stored in the mailbox later, when we store
on disk by uniqueid, but that's another yak to shave.


cyrus.index record:

* INTERNALDATE: change 32 bit => 64 bit time_t
* GMTIME: change 32 bit => 64 bit time_t
* SENTDATE: remove (moved to cache)
* HEADER_SIZE: remove (moved to cache)
* LAST_UPDATED: change 32 bit => 64 bit time_t
* CONTENT_LINES: remove (moved to cache)
* CACHE_CRC: remove (moved to cache)
* CACHE_VERSION: remove (moved to cache)
* Add: CACHE_FILE_NUMBER (32 bit)

Basically I want to remove everything except GMTIME that's derived from
the message out of cyrus.index.  cyrus.index is about remembering FACTS
about the mailbox which aren't available anywhere else.  It's very
important data.

cyrus.cache is all re-creatable from the raw messages.

The reason to keep gmtime is that it's quite common to SORT by sent
date, and making that possible without loading cache is a worthwhile
optimisation.

...

cyrus.cache format changes:

1) there's a section in the unstructured data for CACHEACTIVE,
   which contains a list of (NUM VERSION FLAGS SIZE DIRTYBYTES) -
   probably binary encoded to save space as b32 b16 b16 b32 b32 =>
   128 bits per file.

   e.g. (3 5 0 1894322 1647)

2) each cyrus.cache file starts with the NUM VERSION FLAGS triple, and
   maybe even the SIZE and DIRTYBYTES as well, it wouldn't hurt to
   update them after appending new records.

3) each cyrus.cache record has structure:
   * CACHE_ITEM_LEN 32 bit
   * CACHE_VERSION 32 bit
   * SENTDATE 64 bit time_t
   * HEADER_SIZE 32 bit
   * CONTENT_LINES 32 bit
   * (existing fields with their individual structure)
   * <pad to multiple of 8 bytes>
   * CACHE_ITEM_CRC32 32 bit


On disk the file names are cyrus.cache.N, e.g. cyrus.cache.3

New records are always added to the FIRST active cache file that matches
the criteria of the record, aka if it's ARCHIVED then the first cache
file with the ARCHIVE bit set.

If a cache file gets too big (compile time option, probably 100
megabytes or so) then a new file with the next unused number gets
created and added to the start of the list.

During cyr_expire, if a cache file is more than a configured amount
"dirty" then the records get copied to a newer file and their associated
index records updated to the new locations.  Once it's unreferenced, it
can be safely deleted.

During a normal repack, if most records are being kept, then the
cyrus.cache files will be untouched, saving on IO.

.....

This is all backwards compatible.  Earlier cyrus.index versions will
write just a single cache file.  The upgrade and downgrade facilities
will still work, and convert just fine.  All the existing reading code
will stay.

I'll convert Robert's cache format change code to also be able to write
the old style (or "unknown" if the charset isn't one of the ones with a
numeric code) values for old cache files.

Woohoo.  No more 64 bit nastiness, reduced cache IO in the common case,
and a savings of 4096 bytes (one file) per mailbox from the super-hot
index location in the common case.

Bron.


-- 
  Bron Gondwana
  brong at fastmail.fm


More information about the Cyrus-devel mailing list