squatter lower limits

Paolo Cravero paolo.cravero at csi.it
Fri Sep 15 05:35:21 EDT 2017


Hello.

While looking to do low-level disk usage optimization, some simple performance tests relied on full-text searches (2.4 branch). Metadata always resides on local disks, while messages are on slower hardware.

I noticed that full-text searches with short strings take much longer than longer text. For example, a FT search on 3 letters takes >60" while a 9-letter long string on the same corpus lasts ~20". These tests have been repeated over and over again to exclude disk caching being the culprit: reversing the search order - longer first - has no impact.

So I opened up the cyrus source code and looked for search-related code. As I understand it, squatter is not used if the search string is shorter than 4 symbols. From squat.h it's quite clear:

/*
Don't change this unless you're SURE you know what you're doing.
Its only effect on the API is that searches for strings that are
shorter than SQUAT_WORD_SIZE are not allowed.
In SQUAT, a 'word' simply refers to a string of SQUAT_WORD_SIZE
arbitrary bytes.
*/

#define SQUAT_WORD_SIZE 4

So, question to who knows the squatter implementation in cyrus: is this lower limit applied to all searches? Body, subject, addresse(s)?

And, does this lower bound still apply to 3.0 branch and the new indexing engine Xapian?

Let alone low level disk compression or optimization, a client might not handle well long search times without receiving data on the IMAP channel and dismiss the connection (or a network device could do it). So, if searching for short strings means reading all raw message files, I should warn users through the client interface of possible failures since the mail corpus keeps growing and growing and growing. That's until we upgrade to 3.0, it that helps.

Thanks,

Paolo


More information about the Info-cyrus mailing list