Cyrus database and file usage data

Wed Jun 8 02:22:51 EDT 2016

I'm publishing this to give an idea of the relative sizes of the data
usage on our systems, for deciding how to change internal datastructures
for JMAP support and simplification of the internal database structures
to better support object storage/distributed architecture.

Here's what exists on my mail server for my user:

*config directory (ssd):*

./sieve/domain/f/fastmail.fm/b/brong/websieve.bc
./sieve/domain/f/fastmail.fm/b/brong/defaultbc
./sieve/domain/f/fastmail.fm/b/brong/websieve.script

./domain/f/fastmail.fm/user/b/brong.xapianactive
./domain/f/fastmail.fm/user/b/brong.seen
./domain/f/fastmail.fm/user/b/brong.sub
./domain/f/fastmail.fm/user/b/brong.dav-journal
./domain/f/fastmail.fm/user/b/brong.conversations
./domain/f/fastmail.fm/user/b/brong.counters
./domain/f/fastmail.fm/user/b/brong.dav

./domain/f/fastmail.fm/quota/b/user.brong

*search partition (disk: raid1):*

./domain/f/fastmail.fm/b/user/brong/xapian.55/flintlock
./domain/f/fastmail.fm/b/user/brong/xapian.55/record.baseB
./domain/f/fastmail.fm/b/user/brong/xapian.55/iamchert
./domain/f/fastmail.fm/b/user/brong/xapian.55/position.baseB
./domain/f/fastmail.fm/b/user/brong/xapian.55/postlist.baseA
./domain/f/fastmail.fm/b/user/brong/xapian.55/position.DB
./domain/f/fastmail.fm/b/user/brong/xapian.55/cyrus.indexed.db
./domain/f/fastmail.fm/b/user/brong/xapian.55/postlist.DB
./domain/f/fastmail.fm/b/user/brong/xapian.55/termlist.baseA
./domain/f/fastmail.fm/b/user/brong/xapian.55/termlist.baseB
./domain/f/fastmail.fm/b/user/brong/xapian.55/postlist.baseB
./domain/f/fastmail.fm/b/user/brong/xapian.55/record.baseA
./domain/f/fastmail.fm/b/user/brong/xapian.55/termlist.DB
./domain/f/fastmail.fm/b/user/brong/xapian.55/record.DB
./domain/f/fastmail.fm/b/user/brong/xapian.55/position.baseA
[...]

*hot data partition (ssd):*

./domain/f/fastmail.fm/b/user/brong/cyrus.annotations
./domain/f/fastmail.fm/b/user/brong/cyrus.cache
./domain/f/fastmail.fm/b/user/brong/cyrus.header
./domain/f/fastmail.fm/b/user/brong/cyrus.index
./domain/f/fastmail.fm/b/user/brong/1.
[...]

*Long term data partition (disk: raid6):*

./domain/f/fastmail.fm/b/user/brong/cyrus.cache
./domain/f/fastmail.fm/b/user/brong/1.
[...]

*There is also data related to me in the following global files
on the ssd:*

./annotations.db (calendar metadata and specialuse ~= 30 entries)
./caldav_alarm.sqlite3 (none right now, one entry per future event
with an alarm)
./mailboxes.db (one entry per mailbox plus $RACL entires for shared
mailboxes ~= 200 entries)

================

*metadata disk usage:*

lrwxrwxrwx  1 cyrus mail   11 Feb 15  2014 defaultbc -> websieve.bc
-rw-------  1 cyrus mail  16K May 19 20:24 websieve.bc
-rw-------  1 cyrus mail  13K May 19 20:24 websieve.script
-rw------- 1 cyrus mail 335M Jun  7 19:55
./domain/f/fastmail.fm/user/b/brong.conversations
-rw------- 1 cyrus mail   88 Jun  7 19:55
./domain/f/fastmail.fm/user/b/brong.counters
-rw------- 1 cyrus mail 1.1M Jun  7 19:39
./domain/f/fastmail.fm/user/b/brong.dav
-rw------- 1 cyrus mail  52K Jun  7 08:50
./domain/f/fastmail.fm/user/b/brong.seen
-rw------- 1 cyrus mail 1.2K Apr 13 22:20
./domain/f/fastmail.fm/user/b/brong.sub
-rw------- 1 cyrus mail   41 Jun  7 01:48
./domain/f/fastmail.fm/user/b/brong.xapianactive
-rw------- 1 cyrus mail 63 Jun  7 19:57
./domain/f/fastmail.fm/quota/b/user.brong

counters, seen, sub and xapianactive are all tiny.  They are all flat
files except seen, which is twoskip.

dav is somewhat significantly sized, and is an sqlite database.

conversations is massive.  It's a twoskip database.

sieve script is pretty tiny and it's a flat file/single blob.

*My entire search archive:*
4.0G    search-archive/domain/f/fastmail.fm/b/user/brong/
766M    search/domain/f/fastmail.fm/b/user/brong/

That's all Xapian Chert files plus a small twoskip file per index
(similar format to the seen db):
-rw------- 1 cyrus mail 3.8K Jun  7 01:48
search/./domain/f/fastmail.fm/b/user/brong/xapian.55/cyrus.indexed.db

*SSD:*
Total usage: 893M
cyrus.index: 71M
cyrus.cache: 64M (only has cache for spool files on the SSD)
cyrus.header: 504K
cyrus.annotations: 376M (includes previews for EVERYTHING)
emails: 382M

*Disk:*
Total usage: 13G
cyrus.cache: 1.1G
emails: 12G

*Entire shared files on global ssd:*
1.6M    annotations.db
320K    caldav_alarm.sqlite3
26M     mailboxes.db
*(but barely any of that is mine)*

The most surprising thing standing out to me is the massive amount of
space used by my search archives.  This isn't normal, because for my
entire store, we have:

791G  /mnt/i30t01 (entire disk usage from df -h)
4.2G /mnt/ssd30/sloti30t01/store23/conf (all the non-spool per-user and
  shared files)
15G /mnt/ssd30/sloti30t01/store23/spool (this week's SSD spool)
33G /mnt/i30search/sloti30t01/store23/search
77G /mnt/i30search/sloti30t01/store23/search-archive

Which means that my search-archive is more than 1/20 of the entire usage
over 752 total users on that store.  For the average use I would expect
that search is closer to 15% of their total spool size.

Anyway, here's my chart:

By far the bulk is the actual email, which is good.  These are raw
RFC822 blobs.  The next biggest thing is cyrus.cache, which accounts for
about 10% of the size of the emails.  cyrus.annotations is 3%,
conversations 2%, cyrus.index 0.5% and the rest is basically nothing.

*Data Churn*

rfc822 files are written once and never changed - they are immutable
after creation
search and cache are both "cache" really - they're generated from the
RFC822 files and are disposable.
cyrus.annotations is an interesting thing, because in theory it can
store things that aren't cache (I use it for the last-alarm-time in
caldav alarm now), but in practice at FastMail it's mostly a preview
that's generated from the message file at delivery, aka - cache.  It
would be good to split these two things out and store the preview in the
cache to keep cyrus.annotations small.

Conversations is entirely generated from the message files, and is
basically cache.  You can recreate it ctl_conversationsdb -R.

Which means that the next thing that's actually "real user-generated
information" is cyrus.index at 0.5% or 71G.  Let's look inside
cyrus.index:

#define OFFSET_UID 0
#define OFFSET_INTERNALDATE 4
#define OFFSET_SENTDATE 8
#define OFFSET_SIZE 12
#define OFFSET_HEADER_SIZE 16
#define OFFSET_GMTIME 20
#define OFFSET_CACHE_OFFSET 24
#define OFFSET_LAST_UPDATED 28
#define OFFSET_SYSTEM_FLAGS 32
#define OFFSET_USER_FLAGS 36
#define OFFSET_CONTENT_LINES 52 /* added for nntpd */
#define OFFSET_CACHE_VERSION 56
#define OFFSET_MESSAGE_GUID 60
#define OFFSET_MODSEQ 80 /* CONDSTORE (64-bit modseq) */
#define OFFSET_THRID 88       /* conversation id, added in v13 */
#define OFFSET_CACHE_CRC 96 /* CRC32 of cache record */
#define OFFSET_RECORD_CRC 100

each record is 104 bytes long, of which:

UID is the key
INTERNALDATE is somewhat primary (we also store it as mtime on the spool
file - it's immutable at least in theory)
SENTDATE is cache (taken from rfc822 header)
SIZE is cache (taken from rfc822 file size)
HEADER_SIZE is cache
GMTIME is cache
CACHE_OFFSET is mutable, but it's just a pointer
LAST_UPDATED is primary, but not user visible - it's internal
recordkeeping
SYSTEM_FLAGS is primary and important user facts (plus some
recordkeeping, the bitspace is partitioned)
USER_FLAGS is primary and important user facts
CONTENT_LINES is cache
CACHE_VERSION is internal recordkeeping
MESSAGE_GUID is cache/sanity checking - it's the sha1sum of the
rfc822 file
MODSEQ is primary / recordkeeping.
THRID is primary / recordkeeping but immutable (allocated at delivery
based on conversationsdb)
CACHE_CRC is recordkeeping
RECORD_CRC is recordkeeping

So of 104 bytes, actually (4 + 4 + 16 + 8 + 8) => 40 bytes per message
are really important.  UID, FLAGS, MODSEQ, THRID.  We're down to about
0.2% of the data now, or maybe 30Mb for the entire server.  Nice.

============

Let's look inside the conversations database, because that's a high
churn random access file.  Remember it used 335 Mb.  A dump as a flat
text file is 149Mb, and segmenting into three datasets:

149M    all.txt
82M    byconv.txt
20K    byfolder.txt
67M    msgids.txt

the folder counts are noise.  The "msgids" file is everything starting
with '<' and looks like this:

<!&!aaaaaaaaaaauaaaaaaaaagayjud6695iv/kkipqn4imbanltncjhprtfudq2lhcbs8e-
bacqa//8aabaaaabwur+oaqihrjg9v4un3qk2aqaaaaa=@spamreducer.eu>    0
d0ca2202f453ad68 1460600647
<!&!aaaaaaaaaaauaaaaaaaaaloc/y79rhhcoseanz1scukbamdcnoymubjpiolnqodg6d8-
aaaaavv8aabaaaabstmhn7+e+tyjbrspmov1raqaaaaa=@gunasekera.com.au> 0
d69877672a8c1a15 1460600464
<!&!aaaaaaaaaaauaaaaaaaaaloc/y79rhhcoseanz1scukbamdcnoymubjpiolnqodg6d8-
aaaaavv8aabaaaadxvsnegyytqbeoardfdwptaqaaaaa=@gunasekera.com.au> 0
d69877672a8c1a15 1460600464

(that's msgid, version, thrid, last-seen)

This file gets cleaned out of records more than 6 months old.

The byconv file is all the keys starting with 'B'.

B00000741084435a6       0 (295339132740792 1 1 0 (0 0 0 0 0 0) ((78
295339132740792 1 1 0)) (("Subversion Commits" NIL fm-cvs list.krot.org
1276694410 1)) "*brong-bah-needstocreateanobject" 3731)
B0000087576508d8a       0 (295339132715691 1 1 0 (0 0 0 0 0 0) ((100
295339132715691 1 1 0)) ((NIL NIL sales softlayer.com 1384495901 1))
SoftLayerDutchHoldingsB.V.BillingAdvanceNotification 4343)
B000011630568958c       0 (295339132740628 1 1 0 (0 0 1 0 0 0) ((90
295339132740628 1 1 0)) (("Matt Rosser" NIL notification+[...]
facebookmail.com 1383195915 1)) Afternoonteam, 8568)
B00004e266b006ec2       0 (295339134881905 2 2 2 (0 0 2 0 0 0) ((59
295339134881905 2 2 2)) (("Dilger, Andreas" NIL andreas.dilger intel.com
1447465224 2)) fixe2fsck-fDdirectorytruncation 54816)
B00004f3a24fd8403       0 (295339132807848 4 4 0 (0 0 4 0 0 0) ((57
295339132807848 4 4 0)) (("Timo Sirainen" NIL tss iki.fi 1329786670 2)
(Robin NIL dovecot r.paypc.com 1329446783 1) (NIL NIL manuel.bertrand
gmail.com 1328617214 1)) "Possiblebrokenindexer(lucene/solr)?" 16883)
[...]

For every thrid this is a mapping to modseq, counts, perl-flag
counts, per folder counts, a complete list of senders in the
conversation, a compressed version of the subject, and the sum of the
sizes of all the messages.

Obviously any time a counted flag changes on any message, the
conversations database needs to update for it.

*THE PLAN[tm]***

For JMAP support, I'm going to discard the existing conversations DB and
create a sqlite database per user which contains everything of value.
I'm also going to subsume everything else (or at least the facts about
it) into this database, which will be called 'brong.userdb' and be
stored on the meta partition.

msgids: there's no need to keep the whole thing.  Normalise it, hash it, store a 64 bit hash as the key in a table.  Potentially this will be used to replace duplicate.db as well, since it's basically a duplicate of the data.  I didn't mention duplicate.db above, but it's yet another twoskip file (87Mb for everyone) containing a map from messageid to delivery information. It's disposable, so stored on tmpfs.  The metadata won't change - thrid, timestamp.

folders:
id INTEGER PRIMARY KEY,  -- similar to the $FOLDER_NAMES in
conversations.db
uniqueid TEXT (UNIQUE INDEX)
mboxname TEXT (INDEX)
deletedmodseq INTEGER -- 0 for not deleted
UNIQUE INDEX (mboxname, deletedmodseq)
-- F-key fields and statuscache fields:
modseq INTEGER (INDEX)
countsmodseq INTEGER (INDEX)
exists INTEGER
unseen INTEGER
convmodseq INTEGER
convexists INTEGER
convunseen INTEGER

messages:
folderid INTEGER
uid INTEGER
PRIMARY KEY (folderid, uid)
guid TEXT (INDEX)
thrid INTEGER (INDEX)
system_flags INTEGER (bitmask)

flagnames: (from all the cyrus.headers)
id INTEGER PRIMARY KEY
flag TEXT

messageflags: (many to many)
folderid INTEGER
uid INTEGER
flagid INTEGER

...

Anyway, this will be landing soon as YET ANOTHER DATABASE that can be
enabled and will be created from mailbox_append_index_record and
mailbox_update_index_record.  To start, it will only be used for JMAP,
but maybe the performance will be good enough to move everything from
cyrus.header and cyrus.index files into this one database and remove
them entirely.  Even if not, this might become the primary source of
truth, with the cyrus.header and cyrus.index files existing purely as
cached copies of the data in here.

"messages" is still quite denormalised at the moment to be similar to
struct index_record.  All the bits which are identical between
messages (aka, gmtime) could be stored keyed by GUID instead, and if
we were willing to enforce shared flags between instances, we could
even store everything keyed by GUID and have a table that just mapped
folderid/uid to guid.

Bron.

--
Bron Gondwana
brong at fastmail.fm
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.andrew.cmu.edu/pipermail/cyrus-devel/attachments/20160608/7f6a616a/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cyrusdata.png
Type: image/png
Size: 21432 bytes
Desc: not available
URL: <http://lists.andrew.cmu.edu/pipermail/cyrus-devel/attachments/20160608/7f6a616a/attachment-0001.png>