squatter lower limits

Robert Stepanek rsto at fastmailteam.com
Fri Sep 15 05:45:02 EDT 2017


Hi,

I haven't worked with the squat backend, but I highly recommend to
switch to the latest stable 3.0 branch and the Xapian backend, if you
can.

The Xapian backend does not discriminate between short and long strings,
provides stemming and multi-tiered search databases, among other
improvements. Note that the squat backend and the squatter tool share
the same name, but really become two separate things, e.g. you'd still
use the squatter executable to index Xapian-backed search databases.

Bron wrote up a description of search setup in 2014:
https://blog.fastmail.com/2014/12/01/email-search-system/

Cheers,
Robert


On Fri, Sep 15, 2017, at 11:35, Paolo Cravero wrote:
> Hello.
> 
> While looking to do low-level disk usage optimization, some simple
> performance tests relied on full-text searches (2.4 branch). Metadata
> always resides on local disks, while messages are on slower hardware.
> 
> I noticed that full-text searches with short strings take much longer
> than longer text. For example, a FT search on 3 letters takes >60" while
> a 9-letter long string on the same corpus lasts ~20". These tests have
> been repeated over and over again to exclude disk caching being the
> culprit: reversing the search order - longer first - has no impact.
> 
> So I opened up the cyrus source code and looked for search-related code.
> As I understand it, squatter is not used if the search string is shorter
> than 4 symbols. From squat.h it's quite clear:
> 
> /*
> Don't change this unless you're SURE you know what you're doing.
> Its only effect on the API is that searches for strings that are
> shorter than SQUAT_WORD_SIZE are not allowed.
> In SQUAT, a 'word' simply refers to a string of SQUAT_WORD_SIZE
> arbitrary bytes.
> */
> 
> #define SQUAT_WORD_SIZE 4
> 
> So, question to who knows the squatter implementation in cyrus: is this
> lower limit applied to all searches? Body, subject, addresse(s)?
> 
> And, does this lower bound still apply to 3.0 branch and the new indexing
> engine Xapian?
> 
> Let alone low level disk compression or optimization, a client might not
> handle well long search times without receiving data on the IMAP channel
> and dismiss the connection (or a network device could do it). So, if
> searching for short strings means reading all raw message files, I should
> warn users through the client interface of possible failures since the
> mail corpus keeps growing and growing and growing. That's until we
> upgrade to 3.0, it that helps.
> 
> Thanks,
> 
> Paolo
> ----
> Cyrus Home Page: http://www.cyrusimap.org/
> List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
> To Unsubscribe:
> https://lists.andrew.cmu.edu/mailman/listinfo/info-cyrus


More information about the Info-cyrus mailing list