Truncated text during Xapian indexing

Sebastian Hagedorn Hagedorn at
Thu Feb 15 07:08:27 EST 2018

--On 15. Februar 2018 um 11:20:32 +0100 Robert Stepanek 
<rsto at> wrote:

> On Thu, Feb 15, 2018, at 10:44, Sebastian Hagedorn wrote:
>> ^Simon^: Is that the first 4Mb of the text/html and/or text/plain parts,
>> or  first 4Mb of the entire message body, ignoring any mime parts?
> This limit defines the maximum byte length per MIME body-part of type
> "text". The byte length is calculated after decoding (e.g.
> quoted-printable), conversion to UTF-8 and search text normalisation
> (e.g. stripping HTML tags, replacing Umlaut characters with their ASCII
> counterparts, etc.). Actually, it also applies to any other
> search-indexed fields, such as subjects, headers, etc. but  in practice
> only is relevant for mail bodies.

Thanks. I suppose in practice that is good enough™️

While we're at it, maybe you can answer some other questions regarding 

Is the setting "search_skipdiacrit" in imapd.conf honored during the 
indexing or is that only relevant while searching? Given your comment 
regarding search normalization above I take it Umlaut characters are not 
considered diacriticals? It's not a huge issue, but as a German university 
it would be nice for our users if a search could distinguish between 
"hatte" and "hätte", as an example.

Just out of curiosity, how is the mapping between a Xapian docid and a 
message file on disk achieved? I played around with xapian-delve and the 
Perl example When I search a term, I get a list of 
docid's, but how do I know which message that is?

    .:.Sebastian Hagedorn - Weyertal 121 (Gebäude 133), Zimmer 2.02.:.
                 .:.Regionales Rechenzentrum (RRZK).:.
   .:.Universität zu Köln / Cologne University - ✆ +49-221-470-89578.:.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: not available
URL: <>

More information about the Info-cyrus mailing list