Removing email from Xapian tier databases

Bron Gondwana brong at fastmailteam.com
Mon Feb 11 09:19:51 EST 2019


It's definitely safe to have one rolling mode writing and one repacking. I wouldn't run multiple repacks in parallel, as they can wind up doing duplicate work (though the end result should always be correct and safe).

Here's what we run:

# Any time the disk gets over 50%, compress -o single down to data
13 * * * * [% INCLUDE cronjob c='/home/mod_perl/hm/scripts/xapian_compact.pl -a -o -d 50 temp data' %]
# Copy the temporary search databases down to data during the week
43 1 * * 1,2,3,4,5,6 [% INCLUDE cronjob c='/home/mod_perl/hm/scripts/xapian_compact.pl -a temp,meta data' %]
# Sundays repack the entire data directory
43 1 * * 0 [% INCLUDE cronjob c='/home/mod_perl/hm/scripts/xapian_compact.pl -a temp,meta,data data' %]
# Late on Sundays, pack any oversized data directories down to archive
0 15 * * 0 [% INCLUDE cronjob c='/home/mod_perl/hm/scripts/xapian_archive.pl -a' %]

And here's the interesting logic. In xapian_compact.pl:

 if ($Opts{d}) {
 my $Path = $Slot->SearchPath();
 my $Usage = df($Path);
 my $RunUsage = df("/run/cyrus");
 return Process::Status->new(0) if ($Usage->{per} < $Opts{d} and $RunUsage->{per} < $Opts{d});
 }

 my @args = (-z => $dest, -t => $src);
 push @args, '-v' if $Opts{v};
 push @args, '-o' if $Opts{o};
 push @args, '-F' if $Opts{F};
 push @args, '-X' if $Opts{X};
 push @args, ('-T' => $Opts{T}) if $Opts{T};
 push @args, ('-u' => $Opts{u}) if $Opts{u};
 my %RunOpts = (
 PrintOutput => 1,
 );
 $RunOpts{Nice} = 1 unless $Opts{N};
 $RunOpts{Daemon} = 1 if $Opts{D};

 $0 = "xapian_compact: $SN";
 $Slot->RunCommand(\%RunOpts, 'squatter', @args);

And in xapian_archive.pl:

my $Percent = $Opts{P} || 20;
[...]

 foreach my $user (sort keys %$DataUsage) {
 my $au = $ArchiveUsage->{$user} || 1;
 my $du = $DataUsage->{$user} || 1;
 if ($du < 5000) {
 print "Too small $user ($du)\n";
 next;
 }
 my $This = int($du * 100 / $au);
 if ($This < $Percent) {
 print "Not enough dirty $user: ($du, $au)\n";
 next;
 }
 print "Recompacting $user: ($du, $au)\n";
 my @args = (-z => 'archive', -t => 'data,archive');
[...]
 
In summary, repack data down to archive if data is more than 1/5 size of existing archive. So each of these scripts is a wrapper around squatter to help it run automatically.

Bron.


On Mon, Feb 11, 2019, at 21:55, Egoitz Aurrekoetxea wrote:
> Now I'm noticing for instance, for moving data between Xapian databases.. you need to launch something like :


> 
> sudo -u cyrus /usr/cyrus/bin/squatter -C /usr/local/etc/imapd.conf -v -z archive -t temp,meta,data,archive -u user/egoitz at sarenet.es
> 
> 
> perhaps would be better to do :
> sudo -u cyrus /usr/cyrus/bin/squatter -C /usr/local/etc/imapd.conf _*-F*_ -v -z archive -t temp,meta,data,archive -u user/egoitz at sarenet.es
> But then, having two Squatter processes running at same time, one for rolling mode and one for moving/repacking data, should not be an issue?.
> 
> 
> Thanks mates!!
> 
> ---
>  
> sarenet
> *Egoitz Aurrekoetxea*
> Departamento de sistemas
> 944 209 470
> Parque Tecnológico. Edificio 103
> 48170 Zamudio (Bizkaia)
> egoitz at sarenet.es
> www.sarenet.es
> 
> Antes de imprimir este correo electrónico piense si es necesario hacerlo.
> 


> El 11-02-2019 11:22, Egoitz Aurrekoetxea escribió:


>> Hi Bron,


>> 


>> So, it would be interesting to run once a day... for instance in cyrus.conf in events section :


>> repack_xapian cmd="squatter -F" at=0200


>> Is it needed top stop the other rolling Squatter we run, in same cyrus.conf as :




>> START {
>>  # do not delete this entry!
>>  recover cmd="ctl_cyrusdb -r"
>> 
>>  squatter cmd="squatter -R"
>> }


>> 


>> Thank you so much for all the clarifications mate :) really :)


>> 


>> Cheers!


>> ---
>>  
>> sarenet
>> *Egoitz Aurrekoetxea*
>> Departamento de sistemas
>> 944 209 470
>> Parque Tecnológico. Edificio 103
>> 48170 Zamudio (Bizkaia)
>> egoitz at sarenet.es
>> www.sarenet.es
>> 
>> Antes de imprimir este correo electrónico piense si es necesario hacerlo.
>> 


>> El 11-02-2019 10:23, Bron Gondwana escribió:


>>> Conversations.db is an index over lots of interesting bits of the message, but the key part that's used by Xapian is the mapping from G key (aka: GUID, aka: sha1 of the message RFC822 data) to individual email. It's used for deduplication and for mapping from results to messages.
>>>  
>>> The data in conversations.db is added and removed in real time as messages are appended and updated in the cyrus.index.
>>>  
>>> The data in the xapian databases on the other hand is append only - so you can wind up with hits that no longer map to existing emails. The way to solve that is with a xapian repack that filters messages - which can be done using the -F flag to squatter.
>>>  
>>> Cheers,
>>>  
>>> Bron.
>>>  
>>> On Sat, Feb 9, 2019, at 23:04, Egoitz Aurrekoetxea wrote:
>>>> Good morning,


>>>> 


>>>> As far as I understood, for Xapian you first create it's conversation database in order to work. Later you create database(s) for each mailbox where Xapian can search in. You can move data between them, new mails become indexed for instance Squatter in rolling mode... that's ok... and understood I think. I was wondering, what happens when mail indexed in the archive database in removed and then does not exist any more in the database... does Squatter rolling log manage that too?.


>>>> 


>>>> By the way. I was wondering if mail gets indexed in the tier databases (for instance in Fastmail in temp, meta, data, archine...) what's the role or function of conversations databases you create with ctl_conversationsdb -b -r ?.


>>>> 


>>>> Cheers!


>>>> --
>>>>  
>>>> sarenet
>>>> *Egoitz Aurrekoetxea*
>>>> Departamento de sistemas
>>>> 944 209 470
>>>> Parque Tecnológico. Edificio 103
>>>> 48170 Zamudio (Bizkaia)
>>>> egoitz at sarenet.es
>>>> www.sarenet.es
>>>>  
>>>> Antes de imprimir este correo electrónico piense si es necesario hacerlo.
>>>> ----
>>>> Cyrus Home Page: http://www.cyrusimap.org/
>>>> List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
>>>> To Unsubscribe:
>>>> https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
>>>  
>>> --
>>>  Bron Gondwana, CEO, FastMail Pty Ltd
>>>  brong at fastmailteam.com
>>>  
>>>  
>>> 
>>> ----
>>>  Cyrus Home Page: http://www.cyrusimap.org/
>>>  List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
>>>  To Unsubscribe:
>>>  https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus
> ----
> Cyrus Home Page: http://www.cyrusimap.org/
> List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
> To Unsubscribe:
> https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus

--
 Bron Gondwana, CEO, FastMail Pty Ltd
 brong at fastmailteam.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.andrew.cmu.edu/pipermail/info-cyrus/attachments/20190211/cbef85e5/attachment-0001.html>


More information about the Info-cyrus mailing list