<!DOCTYPE html><html><head><title></title><style type="text/css">p.MsoNormal,p.MsoNoSpacing{margin:0}</style></head><body><div style="font-family:Arial;">On Sat, May 25, 2019, at 22:19, Dilyan Palauzov wrote:<br></div><blockquote type="cite" id="qt"><div style="font-family:Arial;">Hello Bron,<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">For me it is still not absolutely clear how things work with the  <br></div><div style="font-family:Arial;">Xapian seach backend.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">Has search_batchsize any impact during compacting?  Does this setting  <br></div><div style="font-family:Arial;">say how many new messages have to arrive, before indexing them  <br></div><div style="font-family:Arial;">together in Xapian?<br></div></blockquote><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">No, search_batchsize just means that when you're indexing a brand new mailbox with millions of emails, it will put that many emails at a time into a single transactional write to the search index.  During compacting, this value is not used.<br></div><div style="font-family:Arial;"><br></div><blockquote type="cite" id="qt"><div style="font-family:Arial;">What are the use-cases to call "squatter -F" (In compact mode, filter  <br></div><div style="font-family:Arial;">the resulting database to only include messages which are not expunged  <br></div><div style="font-family:Arial;">in mailboxes with existing name/uidvalidity.) and "squatter -X"  <br></div><div style="font-family:Arial;">(Reindex all messages before compacting.  This mode reads all the  <br></div><div style="font-family:Arial;">lists of messages indexed by the listed tiers, and re-indexes them  <br></div><div style="font-family:Arial;">into a temporary database before compacting that into place)?<br></div></blockquote><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">-F is useful to run occasionally so that your search indexes don't grow forever.  When emails are expunged, their matching terms aren't removed from the xapian indexes, so the database will be bigger than necessary and when you search for a term which is in deleted emails, it will cause extra IO and conversations DB lookups on the document id.<br></div><div style="font-family:Arial;"><br></div><blockquote type="cite" id="qt"><div style="font-family:Arial;">Why shall one keep index of deleted and expunged messages and how to  <br></div><div style="font-family:Arial;">delete references from messages that are both expunged and expired  <br></div><div style="font-family:Arial;">(after cyr_expire -X0, so removed from the hard disk), but keep the  <br></div><div style="font-family:Arial;">index to messages that are still on the hard disk, but the user  <br></div><div style="font-family:Arial;">expunged (double-deleted) them.<br></div></blockquote><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">I'm not sure I understand your question here.  Deleting from xapian databases is slow, and particularly with the compacted form, it's designed to be efficient if you don't write to it.  Finally, since we're de-duplicating by GUID, you would need to do a conversations db lookup for every deleted email to check the refcount before cleaning up the associated record.<br></div><div style="font-family:Arial;"><br></div><blockquote type="cite" id="qt"><div style="font-family:Arial;">How does re-compacting (as in  <br></div><div style="font-family:Arial;">https://fastmail.blog/2014/12/01/email-search-system/) differ from  <br></div><div style="font-family:Arial;">re-indexing (as in the manual page of master/squatter)?<br></div></blockquote><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">"re-compacting" - just means combining multiple databases together into a single compacted database - so the terms in all the source databases are compacted together into a destination database.  I used "re-compacting" because the databases are already all compacted, so it's just combining them rather than gaining the initial space saving of the first compact.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">"re-indexing" involves parsing the email again and creating terms from the source document.  When you "reindex" a set of xapian directories, the squatter reads the cyrus.indexed.db for each of the source directories to know which emails it claims to cover, and reads each of those emails in order to index them again.<br></div><div style="font-family:Arial;"><br></div><blockquote type="cite" id="qt"><div style="font-family:Arial;">What gets indexed?  For a mailbox receiving only reports (dkim, dmarc,  <br></div><div style="font-family:Arial;">mta-sts, arf, mtatls), some of which are archived (zip, gzip) the  <br></div><div style="font-family:Arial;">Xapian index increases very fast.<br></div></blockquote><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">This would be because these emails often contain unique identifiers, which do indeed take a lot of space.  We have had lots of debates over what exactly should be indexed - for example should you index sha1 values (e.g. git commit identifiers)?  They're completely random, and hence all 40 characters need to be indexed each time!  But - it's very handy to be able to search your email for a known identifier and see where it was referenced... so we decided to include them.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">We try not index GPG parts or other opaque blobs where nobody will be interested in searching for the phrase.  Likewise we don't index MIME boundaries, because they're substructure, not something a user would know to search for.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">We have a work in progress on the master branch to index attachments using an external tool to extract text from the attachment where possible, which will increase index sizes even more if enabled!<br></div><div style="font-family:Arial;"><br></div><blockquote type="cite" id="qt"><div style="font-family:Arial;">How can I remove a tier, that contains no data, but is mentioned in  <br></div><div style="font-family:Arial;">the .xapianactive files?<br></div></blockquote><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">If you run a compact which includes that tier as a source and not as a destination, then it should remove that tier from every .xapianactive file, at which point you can remove it from your imapd.conf.<br></div><div style="font-family:Arial;"><br></div><blockquote type="cite" id="qt"><div style="font-family:Arial;">How can I rename a tier?<br></div></blockquote><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">The whole point of tier names not being paths on disk is so you can change the disk path without having to rename the tier.  Tier names are IDs, so you're not supposed to rename them.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">Having said that, you could add a new tier, compact everything across to that tier, then remove the old tier.<br></div><div style="font-family:Arial;"><br></div><blockquote type="cite" id="qt"><div style="font-family:Arial;">How can I efficiently prepend a new tear in the .xapianactive file?   <br></div><div style="font-family:Arial;">“squatter -t X -z Y -o” does add to the .xapianactive files the  <br></div><div style="font-family:Arial;">defaultsearhtier, but first has to duplicate with rsync all existing  <br></div><div style="font-family:Arial;">files.  This is not efficient, as big files have to copied.<br></div></blockquote><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">I'm afraid that's what we have right now.  Again, tiers are supposed to be set up at the start and not fiddled with afterwards, so the system isn't designed to allow you to quickly add a new tier.<br></div><div style="font-family:Arial;"><br></div><blockquote type="cite" id="qt"><div style="font-family:Arial;">> What it does under the hood is creates a new database and copy all  <br></div><div style="font-family:Arial;">> the documents over from the source databases, then compress the end  <br></div><div style="font-family:Arial;">> result into the most compact and fastest xapian format which is  <br></div><div style="font-family:Arial;">> designed to never write again. This compressed file is then stored  <br></div><div style="font-family:Arial;">> into the target database name, and in an exclusively locked  <br></div><div style="font-family:Arial;">> operation the new database is moved into place and the old tiers are  <br></div><div style="font-family:Arial;">> removed from the xapianactive, such that all new searches look into  <br></div><div style="font-family:Arial;">> the single destination database instead of the multiple source  <br></div><div style="font-family:Arial;">> databases.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">I do not get this.  The amount of tiers to check does not reduce after  <br></div><div style="font-family:Arial;">doing merging and with three tears the amount of databases is most of  <br></div><div style="font-family:Arial;">the time three.<br></div></blockquote><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">Not if you're compacting frequently.  We do the following:<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">* hourly<br></div><div style="font-family:Arial;">  - check if tmpfs is > 50% full - quit if not.<br></div><div style="font-family:Arial;">  - run squatter -a -o -t temp -z data<br></div><div style="font-family:Arial;">* daily<br></div><div style="font-family:Arial;">  - regardless of tmpfs size, compact everything on temp and meta down to data<br></div><div style="font-family:Arial;">  - squatter -a -t temp,meta -z data<br></div><div style="font-family:Arial;">* weekly on Sunday - re-compact all data partitions together<br></div><div style="font-family:Arial;">  - squatter -a temp,meta,data -z data<br></div><div style="font-family:Arial;">* And finally, once per week once the re-compact is done, check if we need to filter and recompact the archive, if so:<br></div><div style="font-family:Arial;">  - squatter -a data,archive -z archive -F<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">Since today is Monday, most users will have two, so the xapianactive might be something like:<br></div><div style="font-family:Arial;">  temp:66 data:52 data:51 archive:2<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">Later in the week, it might be:<br></div><div style="font-family:Arial;">  temp:70 data:66 data:55 data:54 data:53 data:52 data:51 archive:2<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">And then maybe it will re-compact on Sunday and the user will have<br></div><div style="font-family:Arial;">  temp:74 archive:3<br></div><div style="font-family:Arial;"><div style="font-family:Arial;"><br></div></div><blockquote type="cite" id="qt"><div style="font-family:Arial;">What happens, if squatter is terminated during Xapian-compacting,  <br></div><div style="font-family:Arial;">apart from leaving temporary files?  Will rerunning it, just start  <br></div><div style="font-family:Arial;">from beginning?<br></div></blockquote><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">The source databases will still be in xapian.active, so yes - a new compact run will take those same source databases and start again.<br></div><div style="font-family:Arial;"><br></div><blockquote type="cite" id="qt"><div style="font-family:Arial;">Is the idea to have three tiers like this:<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">At run time, new messages are indexed by Xapian in squatter-rolling  <br></div><div style="font-family:Arial;">mode on tmpfs/RAM, say on tear T1.<br></div></blockquote><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">That's certainly what we do, since indexing is too IO-intensive otherwise.  <br></div><div style="font-family:Arial;"><br></div><blockquote type="cite" id="qt"><div style="font-family:Arial;">Regalarly, the RAM database is compacted to hard disk (tear T2), say  <br></div><div style="font-family:Arial;">T1 and T2 are megred into T2.  The database on the hard disk is  <br></div><div style="font-family:Arial;">read-only and search in it is accelerated, as the database is “compact”.<br></div></blockquote><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">As above - during the week we don't even merge T2 back together, we compact from T1 to a single small database on T2 - leading to multiple databases on T2 existing at once.<br></div><div style="font-family:Arial;"><br></div><blockquote type="cite" id="qt"><div style="font-family:Arial;">Only if two compactions happen in parallel of the same sources or  <br></div><div style="font-family:Arial;">destination, the merge fails and is skipped for that user.  The merge  <br></div><div style="font-family:Arial;">is retried whenever merging T1 and T2 is rescheduled.<br></div></blockquote><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">Yes - though that's pretty rare on our systems because we use a lock around the cron task, so the only time this would happen is if you ran a manual compaction at the same time as the cron job.<br></div><div style="font-family:Arial;"><br></div><blockquote type="cite" id="qt"><div style="font-family:Arial;">As the databases in T2 get bigger, merging T1 and T2 takes more and  <br></div><div style="font-family:Arial;">more time.  So one more Xapian tear is created, T3.  Less regularly,  <br></div><div style="font-family:Arial;">T2 and T3 are merged into T3.  This process takes a while.  But  <br></div><div style="font-family:Arial;">afterwards, T2 is again small, so merging T1 and T2 into T2 is fast.<br></div></blockquote><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">Yes, that's what we do.  This is also the time that we filter the DB, so the T3 database only contains emails which were still alive at the time of compaction.<br></div><div style="font-family:Arial;"><br></div><blockquote type="cite" id="qt"><div style="font-family:Arial;">How many tears make sense, apart from having one more for power-off events?<br></div></blockquote><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">Having another one for power off events doesn't make heaps of sense unless you have a fast disk.  That's kind of what our "meta" partition is, it's an SSD RAID1 that's faster than the "data" partition which is a SATA spinning RAID1 set.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">When we power off a server, we run a task to compact all the temp partitions down - it used to be to meta, but we found that compacting straight to data was plenty fast, so we just do that now!<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">If you power off a server without copying the indexes off tmpfs, they are of course lost.  This means that you need to run squatter -i on the server after reboot to index all the recent messages again!  So we always run a squatter -i after a crash or power outage before bringing that server back into production.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">Cheers,<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">Bron.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">--<br></div><div id="sig56629417"><div class="signature">  Bron Gondwana, CEO, FastMail Pty Ltd<br></div><div class="signature">  brong@fastmailteam.com<br></div><div class="signature"><br></div></div><div style="font-family:Arial;"><br></div></body></html>