What's in the conversations DB, and plans for conversation splitting

Bron Gondwana brong at fastmail.fm
Thu Feb 9 00:02:08 EST 2017

I've been trying to rebuild conversations databases at FastMail to add the new pre-sorted thread data, and I'm hitting problems with one user who has a 50,000 message conversation.  Yes, really.

We've talked about splitting conversations at a reasonable size limit, 100, 250, 512, whatever.  Heaps less than 50k.

I am using thrid and thread as the terminology rather that cid and conversation, since we are planning to migrate to that naming to be compatible with X-GM-THRID and friends.

Here's what we have now in a user.conversations database file:

$COUNTED_FLAGS  \Draft \Flagged $IsMailingList $IsNotification $HasAttachment $HasTD
$FOLDER_NAMES (brong.net!user.brong brong.net!user.brong.#addressbooks.Default ...)
<1854861326.738774.1447379931850.javamail.app at ltx1-app10519.prod.linkedin.com>  0 6fb7d5120b8f9034 1480422704
<1854883027.206388843.1338533743634.javamail.cboxp at ednabay.apple.com> 0 f9bcee0472ff4195 1480422709
<1855265052.15228760.1482500694926.javamail.root at ninus.ocn.ne.jp> 0 eda2e001f73a4211 1482500704
<1855785634.142624.1477413929314.javamail.app at ela4-app8372.prod.linkedin.com> 0 806410893c2c8322 1480422756
<1856. at mail.brong.net>  0 5ead2028d08d72a7 1480422707
<1856. at mail.brong.net>  0 5ead2028d08d72a7 1480422730
<1856079392.76211482248409804.javamail.gilthunderhead_svc at natthundprodapp>  0 f584cc80149c7c6b 1482248425
B00014f0fa1c61ce2 0 (415658 2 2 0 (0 0 0 0 0 0) ((63 410424 1 1 0) (65 415658 1 1 0)) ((=?utf-8?Q?TECH4U.COM.AU?= NIL TECH4U.COM.AU tech4u.com.au 1405567846 2)) 4-BayN54LMicroServer$269|GTX7704GBGamingGraphicsCard$479 226322 ((00014f0fa1c61ce21f770dc217003c49945534bb 2 1405567870 3857938598)))
B00016bdc28aa5245 0 (410448 1 1 0 (0 0 0 0 0 0) ((35 410448 1 1 0)) ((root NIL root pushme-pullyou.brong.net 1081144803 1)) IAMREPORT 2558 ((00016bdc28aa5245918448904c56ee64ea94aade 1 1081144803 4293330183)))
B000191314cae6c63 0 (410434 1 1 0 (0 0 0 0 0 0) ((27 410434 1 1 0)) (("NZMB Diplomacy Judge" NIL judge gem.win.co.nz 1085955619 1)) newbies2-S1902MPressfromEtoF 2482 ((000191314cae6c63a1aa90f99cf9d12bf93daa30 1 1085955665 753666634)))
Baa150752ffefb9da 0 (410439 1 1 0 (0 0 0 0 0 0) ((15 410439 1 1 0)) (("Bron Gondwana" NIL brong h-r-s.com 1073614242 1)) GoCRF2.1differencesfromGoCRF2.0 7731 ((aa150752ffefb9da6937f06f0eed2f7d752a5ebb 1 1073614242 3871373670)))
Baa150a2e47254b77 0 (415662 2 2 0 (0 0 0 0 0 0) ((63 410424 1 1 0) (65 415662 1 1 0)) (("The Economist" NIL TheEconomist execnews.eu 1413923577 2)) ExecutiveSubscriptionPlan:12weeksarenowonly$15 41388 ((aa150a2e47254b77bc18aa68f31ddfa109d7a97c 2 1413923585 1048183371)))
Bfffe3c76241b60e7 0 (410450 1 1 0 (0 0 1 0 0 0) ((49 410450 1 1 0)) (("Glenn Satchell" NIL Glenn.Satchell uniq.com.au 1257130670 1)) SAGE-AUNameChangeSurvey 4756 ((fffe3c76241b60e70467d613ca7c215eb36d3d07 1 1257130712 933063352)))
Bfffe7e672838ff33 0 (410439 1 1 0 (0 0 1 0 0 0) ((15 410439 1 1 0)) (("Martin Schulze" NIL joey infodrom.org 1084909593 1)) DebianWeeklyNews-May18th,2004 15255 ((fffe7e672838ff335904d710a0899213e19c6100 1 1084911167 530699332)))
Fbrong.net!user.brong 0 (477061 8867 8)
Fbrong.net!user.brong.#addressbooks 0 (83829 0 0)
Fbrong.net!user.brong.#addressbooks.Default 0 (476964 894 894)

So that's my current conversations database.  Let's look at each field in more detail:

'$' keys (variables)

$COUNTED_FLAGS => offsets into the counters in the B keys (see later)
$FOLDER_NAMES => mappings from folder numbers in the B keys (see later)

'<' keys (message ids)

Key is Message-ID (from Message-Id, In-Reply-To, References and X-ME-Message-Id headers)

Value is: version thrid timestamp, space separated atoms

version is always 0
thrid is the rest of the B key (absent leading 'B'), a 64 bit value hex encoded
timestamp is the unix time_t as decimal of the internaldate of the latest message seen referencing this message-id.

'B' keys (threads)

One key per thread.  The key is 'B' followed by the hex encoded 64 bit thread id, which matches the 64 bit value in the cyrus.index for each message in the thread.

The value is quite detailed, so let's look at the TECH4U one:

B00014f0fa1c61ce2 0 (415658 2 2 0 (0 0 0 0 0 0) ((63 410424 1 1 0) (65 415658 1 1 0)) ((=?utf-8?Q?TECH4U.COM.AU?= NIL TECH4U.COM.AU tech4u.com.au 1405567846 2)) 4-BayN54LMicroServer$269|GTX7704GBGamingGraphicsCard$479 226322 ((00014f0fa1c61ce21f770dc217003c49945534bb 2 1405567870 3857938598)))


HIGHESTMODSEQ - the highest modseq of any message in this thread (including expunges)
COUNT - the total count of messages in this thread (including expunged messages)
EXISTS - the total count of unexpunged messages in this thread across all folders.
UNSEEN - the total number of messages in this thread which do not have the \Seen flag set for the owner (system_flags)

FLAGS: for each item in counted flags, the count in that order of unexpunged messages which have that flag set (user_flags or system_flags).  $SYSTEM_FLAGS is set at database creation, and the only way to change it is to rebuild the entire database.

FOLDERS: for each folder a list containing (NUMBER HIGHESTMODSEQ COUNT EXISTS UNSEEN). You can see that the TECH4U email has messages in two folders, number 63 and 65, and in each case there is a single message which still exists. NUMBER is an offset into the $FOLDER_NAMES list of this folder, so '0' is my INBOX, brong.net!user.brong, and so on. HIGHESTMODSEQ is the highest modseq of the messages in each of those folders, so you can see that the more recent message is the one in folder 65. The next three number fields are the same as for the entire conversation, but only for the counts in this one particular folder.

SENDERS: for each sender which has been mentioned in the conversation, the name, route, mailbox and domain (see IMAP BODYSTRUCTURE definitions) for each sender, followed by the timestamp of the latest internaldate of a message with that sender, and a count of total number of messages with that sender.  This is used for FastMail XCONV commands, but will not be used for JMAP.

SUBJECT: a normalised version of the subject of every message in this thread.  The subject is used for every match except that done from X-ME-Message-Id, which bypasses normal subject checks.

SIZE: the sum in bytes of the sizes of all the unexpunged messages in this entire thread, across all folders.

THREAD: a list of all the messages in this thread (GUID EXISTS INTERNALDATE MSGID) where GUID is the digest.sha1 value of the message itself, EXISTS is the total number of records across all folders with this particular GUID, INTERNALDATE is the maximum of the internaldates of those messages, and MSGID is a 32 bit crc32 of the Message-Id header.  Optionally there is a 5th field INREPLYTO which is the 32 bit crc32 of the In-Reply-To header, but is only present on drafts.  This is used for the JMAP thread sorting algorithm, and is stored in pre-sorted order for fast getThreads.

'F' keys (folder metadata)

Some simple metadata for each folder.  Let's look at my INBOX:

Fbrong.net!user.brong 0 (477061 8867 8)

VERSION (always 0) followed by a dlist of (HIGHESTMODSEQ EXISTS UNSEEN)

HIGHESTMODSEQ - the highest modseq of any conversation which is present in this folder (including expunges)
EXISTS - the number of conversations which have a non-expunged message within this folder
UNSEEN - the number of conversations which have both a non-expunged message within this folder, and have a non-zero unseen count. NOTE: it is NOT necessary for the unseen message to be present in the folder, just that there is an unseen message somewhere in the conversation, and that the conversation is also in this folder.

'G' keys (guid mappings)

These keys have no value at all, all the data is stored in the key:


Key is:

'G' GUID ':' NUMBER ':' UID [ '[' PARTSPEC ']' ]

So for a message GUID there is no trailing partspec.  NUMBER is a folder number per the $FOLDER_NAMES above, and UID is the value of the UID field in cyrus.index for the non-expunged record with this GUID (digest.sha1).  This is equivalent to Gmail's X-GM-MSGID, but I'm not going to reuse the term MSGID because it's insanely overloaded in the email space.

You can have multiple emails with the same GUID, for example let's find the GUIDs of that TECH4U email (note in the THREAD part of the 'B' key it has just a single GUID which is present twice...)


And there we have it.  UID 51202 in folder 63 and UID 51201 in folder 65 (yes, this is me creating a giant folder and copying a ton of old archived rubbish into it, then copying the whole lot to another folder for testing purposes.  Every email in those two folders has two copies)


And that's the structure of the conversations DB as it exists now.  I will follow up to this email with a description of how I want to change this to support the additional features we want while not losing everything else.

  Bron Gondwana
  brong at fastmail.fm

More information about the Cyrus-devel mailing list