Truncated text during Xapian indexing
rsto at fastmailteam.com
Thu Feb 15 10:12:23 EST 2018
On Thu, Feb 15, 2018, at 13:08, Sebastian Hagedorn wrote:
> Is the setting "search_skipdiacrit" in imapd.conf honored during the
> indexing or is that only relevant while searching? Given your comment
> regarding search normalization above I take it Umlaut characters are not
> considered diacriticals? It's not a huge issue, but as a German university
> it would be nice for our users if a search could distinguish between
> "hatte" and "hätte", as an example.
Cyrus considers Umlaut characters as diacriticals (I was just handwaving that away in my previous comment due to the default settings). The skip_diacrit setting applies to both indexing and search.
As an example, let's append two emails to a mailbox. The body of message 1 contains the German verb "gären". Message 2 contains the verb "garen" (for the non-German speakers: these verbs mean two different things).
With skip_diacrit set to true (the default), this is what lands in the Xapian database:
[...] Zgaren garen
and searches for "garen" and "gären" will both match both messages.
With skip_diacrit set to false, however, we get
[...] Zgaren Zgären garen gären
and searches for "garen" and "gären" will only match the respective messages.
I uploaded a new test to Cassandane that demonstrates this  (the subject_isutf8 test case might also be of interest). I'd just deactivate search_skipdiacrit if you are sure that your users will benefit from it. If in doubt, I would rather err on the safe side and return false positives by skipping diacritics (the default).
There's more to say about the Z prefixes: Cyrus currently uses the English stemmer for all text, resulting in stem terms that typically match their non-stemmed original input for non-English text. While this might seem odd, it's the best we can do without proper language detection for both indexing and search. I implemented multi-language stem support in an experimental feature branch, but didn't resolve the issues around fingerprinting search queries, yet. There's an open issue to track this .
> Just out of curiosity, how is the mapping between a Xapian docid and a
> message file on disk achieved? I played around with xapian-delve and the
> Perl example simplesearch.pl. When I search a term, I get a list of
> docid's, but how do I know which message that is?
In 3.x, Cyrus search stores an internal unique message id, called guid, as docid in Xapian. The guid currently is a SHA-1 hash of the raw message, allowing for deduplication and to avoid re-indexing already seen messages. The conversations.db of a user maps this guid to a list of mailbox:UID pairs.
Off the top of my head, there currently isn't an "official" way in Cyrus to retrieve the mailbox:UID list for a given guid outside the Cyrus process. Depending on your use case, you could either: 1.) build your custom mapper on imap/conversations.h, 2.) use cvt_cyrusdb to dump the contents of a conversations.db into plain text. Or 3.) use the JMAP layer to fetch JMAP-formatted message or the raw message blob by id. For JMAP email, use the guid and prefix it with 'M' in an Email/get method. For blobs, use 'G' as prefix. Both are "unofficial": we might change the JMAP id scheme in future releases. But I guess this isn't going to happen any time soon, if ever.
Hope it helps,
More information about the Info-cyrus