Prepending Xapian Tiers

Dilyan Palauzov Dilyan.Palauzov at aegee.org
Sat May 25 08:18:12 EDT 2019


Hello Bron,

For me it is still not absolutely clear how things work with the  
Xapian seach backend.

Has search_batchsize any impact during compacting?  Does this setting  
say how many new messages have to arrive, before indexing them  
together in Xapian?

What are the use-cases to call "squatter -F" (In compact mode, filter  
the resulting database to only include messages which are not expunged  
in mailboxes with existing name/uidvalidity.) and "squatter -X"  
(Reindex all messages before compacting.  This mode reads all the  
lists of messages indexed by the listed tiers, and re-indexes them  
into a temporary database before compacting that into place)?

Why shall one keep index of deleted and expunged messages and how to  
delete references from messages that are both expunged and expired  
(after cyr_expire -X0, so removed from the hard disk), but keep the  
index to messages that are still on the hard disk, but the user  
expunged (double-deleted) them.

How does re-compacting (as in  
https://fastmail.blog/2014/12/01/email-search-system/) differ from  
re-indexing (as in the manual page of master/squatter)?

What gets indexed?  For a mailbox receiving only reports (dkim, dmarc,  
mta-sts, arf, mtatls), some of which are archived (zip, gzip) the  
Xapian index increases very fast.

How can I remove a tier, that contains no data, but is mentioned in  
the .xapianactive files?

How can I rename a tier?

How can I efficiently prepend a new tear in the .xapianactive file?   
“squatter -t X -z Y -o” does add to the .xapianactive files the  
defaultsearhtier, but first has to duplicate with rsync all existing  
files.  This is not efficient, as big files have to copied.

> What it does under the hood is creates a new database and copy all  
> the documents over from the source databases, then compress the end  
> result into the most compact and fastest xapian format which is  
> designed to never write again. This compressed file is then stored  
> into the target database name, and in an exclusively locked  
> operation the new database is moved into place and the old tiers are  
> removed from the xapianactive, such that all new searches look into  
> the single destination database instead of the multiple source  
> databases.

I do not get this.  The amount of tiers to check does not reduce after  
doing merging and with three tears the amount of databases is most of  
the time three.

What happens, if squatter is terminated during Xapian-compacting,  
apart from leaving temporary files?  Will rerunning it, just start  
from beginning?

Is the idea to have three tiers like this:

At run time, new messages are indexed by Xapian in squatter-rolling  
mode on tmpfs/RAM, say on tear T1.

Regalarly, the RAM database is compacted to hard disk (tear T2), say  
T1 and T2 are megred into T2.  The database on the hard disk is  
read-only and search in it is accelerated, as the database is “compact”.

Only if two compactions happen in parallel of the same sources or  
destination, the merge fails and is skipped for that user.  The merge  
is retried whenever merging T1 and T2 is rescheduled.

As the databases in T2 get bigger, merging T1 and T2 takes more and  
more time.  So one more Xapian tear is created, T3.  Less regularly,  
T2 and T3 are merged into T3.  This process takes a while.  But  
afterwards, T2 is again small, so merging T1 and T2 into T2 is fast.

How many tears make sense, apart from having one more for power-off events?

Regards
   Дилян
----- Message from Bron Gondwana <brong at fastmailteam.com> ---------
    Date: Tue, 21 May 2019 21:46:42 +1000
    From: Bron Gondwana <brong at fastmailteam.com>
Subject: Re: Prepending Xapian Tiers
      To: Дилян Палаузов <Dilyan.Palauzov at aegee.org>
      Cc: Cyrus Devel <cyrus-devel at lists.andrew.cmu.edu>


> On Tue, May 21, 2019, at 18:41, Dilyan Palauzov wrote:
>> Hello,
>>
>> thanks, Bron, for your answer.
>>
>> I gave it a try.
>>
>> squatter does not remove .NEW directories when aborted (SIGINT), the
>> directories have to be removed manually
>
> https://github.com/cyrusimap/cyrus-imapd/issues/2765
>
>>
>> squatter -t X -z X -o recognizes, when the directory structure behind
>> tier X exists, that nothing has to be done, prints “Skipping X for
>> user.ABC, only one” and quits, without updating the .xapianactive files.
>
> yeah right, that won't work. Glad to know :)
>
>> squatter -t Y -z Y -o, when the directory structructure behind tier Y
>> does not exist, prints “compressing Y:1,Y:0 to Y:2 for user... (active
>> Y:1,Y:0)”. As far as I remember this has not updated the xapianactive
>> files.
>
> Yeah right, it won't add a new target unless you are compressing the  
> current first item in xapianactive.
>
>> squatter -t X -z Y -o does add to the .xapianactive files the
>> defaultsearhtier, but first has to duplicate with rsync all existing
>> files. This takes a while… But at the end did what I wanted.
>> Afterwards the directory structure for the new tier was not created.
>> The directory structure was created once I started all the cyrus
>> processes again.
>
> That makes sense. We don't create a directory structure until a  
> document gets created in there.
>
>> squatter -t X -z Y -o emits the message “undefined search partition
>> X,Ysearchpartition-default” and then “compressing X:0,X,Y:0 to Y:2 for
>> ... (active Y:0,X:0,X,Y:0,Y:1)”.
>
> That sounds like a sanity checking failure! Good catch:
>
> https://github.com/cyrusimap/cyrus-imapd/issues/2764
>
>> Does squatter -t X -z Y append X to Y, or it deletes Y and copies X to
>> Y? In the latter case, is there any (performance) difference between
>> "squatter -t X,Y -z Y" and “squatter -t Y,X -z Y”?
>
> There's no difference in what order you add items to -t. -t is a  
> comma separated list of selectors for source items. You can even  
> explicitly say:
>
> squatter -t X:0,X:2,Y:45 -z Y and it will compact just those three  
> sources into a new target in Y.
>
> What it does under the hood is creates a new database and copy all  
> the documents over from the source databases, then compress the end  
> result into the most compact and fastest xapian format which is  
> designed to never write again. This compressed file is then stored  
> into the target database name, and in an exclusively locked  
> operation the new database is moved into place and the old tiers are  
> removed from the xapianactive, such that all new searches look into  
> the single destination database instead of the multiple source  
> databases.
>
>> Can one xapian tier store a document, and another tier store the
>> information, that the address of the document has changed?
>
> It doesn't work like that. The addresses of the documents never  
> change (they are the sha1 of the document contents, and Cyrus  
> documents are all immutable). The xapian engine searches across the  
> full set of databases listed in xapianactive in order to find  
> document ids, then maps them through the conversations.db file to  
> find the actual emails. A copy/move of an email updates the  
> conversations.db lookups, so the next search will find the new  
> location without anything changing in xapian.
>
> the cyrus.indexed.db file is just a convenience to allow rolling  
> squatter to avoid having to re-scan records that it knows are  
> already indexed.
>
> Bron.
>
>> Regards
>>  Дилян
>>
>> ----- Message from Bron Gondwana <brong at fastmailteam.com> ---------
>>  Date: Mon, 20 May 2019 18:52:07 +1000
>>  From: Bron Gondwana <brong at fastmailteam.com>
>> Subject: Re: Prepending Xapian Tiers
>>  To: Cyrus Devel <cyrus-devel at lists.andrew.cmu.edu>
>>
>>
>> > On Fri, May 17, 2019, at 23:52, Дилян Палаузов wrote:
>> >> Hello,
>> >>
>> >> I set up a Cyrus system with one tier. I think it works. The
>> >> .xapianactive files contain 'tiername: 0'.
>> >>
>> >> How can I insert a second tier?
>> >
>> > I have never tried this on a live server! Clearly the right thing to
>> > do is to build a cassandane search which implements doing this so
>> > that we can make sure it works.
>> >
>> >> Adding a XYZsearchpartition-default to imapd.conf, together with
>> >> defaultsearchtier: XYZ does not utilize the new directory: it stays
>> >> empty and the .xapianactive files do not get updated to mention the
>> >> new tier.
>> >
>> > That looks like it should work. I assume you have restarted your
>> > cyrus since making the change? I'm not certain that a rolling
>> > squatter will discover a new config in the way that imapd does.
>> >
>> > Also - you'll need to run squatter in compact mode in order to add a
>> > new xapianactive entry. The simplest could be:
>> >
>> > squatter -z tiername -t tiername -o
>> >
>> > I believe that given your current setup, this will just copy the
>> > entry from tiername:0 to tirename:1 and also create XYZ:0 in the
>> > xapianactive file at the same time.
>> >
>> >> Besides, if a message is MOVEd over IMAP, is any optimization
>> >> utilized, to avoid reindexing the message, but just change the
>> >> address of the document?
>> >
>> > Yes, both XAPINDEXED mode where the GUID is read from xapian, and
>> > CONVINDEXED mode where the GUID is looked up via user.conversations
>> > and then mapped into the cyrus.indexed.db files in each xapian tier
>> > allow Xapian to skip reindexing when a message is already indexed.
>> > This works for both MOVE and for re-uploading of an identical
>> > message file via IMAP.
>> >
>> > Cheers,
>> >
>> > Bron.
>> >
>> > --
>> > Bron Gondwana, CEO, FastMail Pty Ltd
>> > brong at fastmailteam.com
>>
>>
>> ----- End message from Bron Gondwana <brong at fastmailteam.com> -----
>>
>>
>>
>
> --
>  Bron Gondwana, CEO, FastMail Pty Ltd
>  brong at fastmailteam.com


----- End message from Bron Gondwana <brong at fastmailteam.com> -----




More information about the Cyrus-devel mailing list