cyrus.index version 14 and cyrus.cache upgrades
Bron Gondwana
brong at fastmail.fm
Mon Apr 4 06:18:59 EDT 2016
Update - happening at 9pm Melbourne time! I see I set it to be the same UTC always.
On Sun, Apr 3, 2016, at 22:21, Bron Gondwana via Cyrus-devel wrote:
> (this is a discussion piece for talking about in tomorrow's meeting,
> which IS happening at the regular 10pm Melbourne time - that's now
> ANOTHER hour later for everyone due to timezones changing. I
> haven't written any code yet)
>
> On top of Robert's work to support libicu for charset conversion and
> pick up all the rest of the character sets it supports, we need to make
> some cache format changes.
>
> I also have a user at FastMail with a 3.8 million message "Deleted
> Messages" folder, and I can't keep manually splitting giant folders for
> people just because their cyrus.cache file gets over 4 gig.
>
> So I'm proposing the following changes:
>
> cyrus.index version 14:
>
> cyrus.index header:
>
> * LAST_APPEND_DATE: 32 bit => 64 bit time_t
> * POP3_LAST_LOGIN: 32 bit => 64 bit time_t
> * LEAKED_CACHE: remove
> * FIRST_EXPUNGED: 32 bit => 64 bit time_t
> * LAST_REPACK_TIME: 32 bit => 64 bit time_t
> * HEADER_FILE_CRC: remove
> * RECENT_TIME: 32 bit => 64 bit time_t
> * POP3_SHOW_AFTER: 32 bit => 64 bit time_t
> * add UNIQUEID: 40 characters (enough space for a uuidgen UUID or
> whatever)
> * add a bunch of space for un-fixed-width quotaroot and flag names.
>
> By doing this, we no longer have a separate cyrus.header and
> cyrus.index. We only have ONE file in which facts are stored (except
> cyrus.annotations, but I have plans for that too).
>
> If the non-fixed data gets too big then we create a new file called
> cyrus.indexoverflow which contains just the non-fixed data. This is
> another 99%/1% case. In 99% of cases we won't create enough (i.e. long
> flag names) to fill the space. If we fix the header size at 2048 bytes,
> we save in the common case of an almost empty mailbox, while still
> working for huge mailboxes.
>
> There's a mailbox options flag to say to read from the
> indexoverflow file.
>
> ACL is no longer stored in this file. It's not a property of the
> mailbox in any meaningful way - it belongs out in mailboxes.db and the
> next layer up (eventually).
>
> mailboxname probably will get stored in the mailbox later, when we store
> on disk by uniqueid, but that's another yak to shave.
>
>
> cyrus.index record:
>
> * INTERNALDATE: change 32 bit => 64 bit time_t
> * GMTIME: change 32 bit => 64 bit time_t
> * SENTDATE: remove (moved to cache)
> * HEADER_SIZE: remove (moved to cache)
> * LAST_UPDATED: change 32 bit => 64 bit time_t
> * CONTENT_LINES: remove (moved to cache)
> * CACHE_CRC: remove (moved to cache)
> * CACHE_VERSION: remove (moved to cache)
> * Add: CACHE_FILE_NUMBER (32 bit)
>
> Basically I want to remove everything except GMTIME that's derived from
> the message out of cyrus.index. cyrus.index is about remembering FACTS
> about the mailbox which aren't available anywhere else. It's very
> important data.
>
> cyrus.cache is all re-creatable from the raw messages.
>
> The reason to keep gmtime is that it's quite common to SORT by sent
> date, and making that possible without loading cache is a worthwhile
> optimisation.
>
> ...
>
> cyrus.cache format changes:
>
> 1) there's a section in the unstructured data for CACHEACTIVE,
> which contains a list of (NUM VERSION FLAGS SIZE DIRTYBYTES) -
> probably binary encoded to save space as b32 b16 b16 b32 b32 =>
> 128 bits per file.
>
> e.g. (3 5 0 1894322 1647)
>
> 2) each cyrus.cache file starts with the NUM VERSION FLAGS triple, and
> maybe even the SIZE and DIRTYBYTES as well, it wouldn't hurt to
> update them after appending new records.
>
> 3) each cyrus.cache record has structure:
> * CACHE_ITEM_LEN 32 bit
> * CACHE_VERSION 32 bit
> * SENTDATE 64 bit time_t
> * HEADER_SIZE 32 bit
> * CONTENT_LINES 32 bit
> * (existing fields with their individual structure)
> * <pad to multiple of 8 bytes>
> * CACHE_ITEM_CRC32 32 bit
>
>
> On disk the file names are cyrus.cache.N, e.g. cyrus.cache.3
>
> New records are always added to the FIRST active cache file that matches
> the criteria of the record, aka if it's ARCHIVED then the first cache
> file with the ARCHIVE bit set.
>
> If a cache file gets too big (compile time option, probably 100
> megabytes or so) then a new file with the next unused number gets
> created and added to the start of the list.
>
> During cyr_expire, if a cache file is more than a configured amount
> "dirty" then the records get copied to a newer file and their associated
> index records updated to the new locations. Once it's unreferenced, it
> can be safely deleted.
>
> During a normal repack, if most records are being kept, then the
> cyrus.cache files will be untouched, saving on IO.
>
> .....
>
> This is all backwards compatible. Earlier cyrus.index versions will
> write just a single cache file. The upgrade and downgrade facilities
> will still work, and convert just fine. All the existing reading code
> will stay.
>
> I'll convert Robert's cache format change code to also be able to write
> the old style (or "unknown" if the charset isn't one of the ones with a
> numeric code) values for old cache files.
>
> Woohoo. No more 64 bit nastiness, reduced cache IO in the common case,
> and a savings of 4096 bytes (one file) per mailbox from the super-hot
> index location in the common case.
>
> Bron.
>
>
> --
> Bron Gondwana
> brong at fastmail.fm
--
Bron Gondwana
brong at fastmail.fm
More information about the Cyrus-devel
mailing list