Truncated text during Xapian indexing
Hagedorn at uni-koeln.de
Thu Feb 15 07:08:27 EST 2018
--On 15. Februar 2018 um 11:20:32 +0100 Robert Stepanek
<rsto at fastmailteam.com> wrote:
> On Thu, Feb 15, 2018, at 10:44, Sebastian Hagedorn wrote:
>> ^Simon^: Is that the first 4Mb of the text/html and/or text/plain parts,
>> or first 4Mb of the entire message body, ignoring any mime parts?
> This limit defines the maximum byte length per MIME body-part of type
> "text". The byte length is calculated after decoding (e.g.
> quoted-printable), conversion to UTF-8 and search text normalisation
> (e.g. stripping HTML tags, replacing Umlaut characters with their ASCII
> counterparts, etc.). Actually, it also applies to any other
> search-indexed fields, such as subjects, headers, etc. but in practice
> only is relevant for mail bodies.
Thanks. I suppose in practice that is good enough™️
While we're at it, maybe you can answer some other questions regarding
Is the setting "search_skipdiacrit" in imapd.conf honored during the
indexing or is that only relevant while searching? Given your comment
regarding search normalization above I take it Umlaut characters are not
considered diacriticals? It's not a huge issue, but as a German university
it would be nice for our users if a search could distinguish between
"hatte" and "hätte", as an example.
Just out of curiosity, how is the mapping between a Xapian docid and a
message file on disk achieved? I played around with xapian-delve and the
Perl example simplesearch.pl. When I search a term, I get a list of
docid's, but how do I know which message that is?
.:.Sebastian Hagedorn - Weyertal 121 (Gebäude 133), Zimmer 2.02.:.
.:.Regionales Rechenzentrum (RRZK).:.
.:.Universität zu Köln / Cologne University - ✆ +49-221-470-89578.:.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 191 bytes
Desc: not available
More information about the Info-cyrus