RFC: Charset Conversion Routines
Bron Gondwana
brong at fastmail.fm
Mon Feb 23 22:17:26 EST 2009
I'm in the process of rewriting the lib/mkchartable.c
and lib/charset.c with the eventual goal being a more
flexible charset conversion API that can be used to
make sieve rules match on the decoded values, and
other funky things.
It turns out to be quite a lot of changes. My initial
work in progress is up here:
http://github.com/brong/cyrus-imapd/commit/863b5b51dd27f184fa00de4ec5a6aca3308fc30e
As you can see, it's quite a bit of code.
Anyway - I'd like some feedback on a couple of things:
a) It's going to use a little more CPU this way, because
instead of having a table that converts _directly_ from
the source charset to utf-8 in search-canonical-form,
it does one conversion to unicode characters (16bit),
then another table converts that into a stream of zero
to 15 characters (yes, something expands to 15 separate
codepoints, no, I don't want to know what it is!)
Finally a third pass converts to utf-8 from the
character codepoints.
b) Should we make this 32bit unicode characters while we're
at it, and extend the UTF-8 converter?
c) For that matter, should we just be outsourcing all this
crap to another library? Does anyone know a good library
that can do what Cyrus does (take one character at a time
and keep state?)
d) Whitespace compression. I'm currently mapping all
whitespace to ' ' instead of '', and then either stripping
all ' ' from the string, or only outputting them if the
previous character on the output string was not a space.
Rob tells me that there are some issues with asian charsets
and space not having any meaning - how best to handle?
e) Interfaces, interfaces, interfaces. At the moment we have:
* charset_compilepat - for use in:
* charset_searchstring
* charset_searchfile
* charset_decode_mimebody - and
* charset_encode_mimebody
* charset_extractfile
My current implementation that I'm working on uses "int flags"
as an extra parameter to each of these, allowing CHARSET_CANON
and CHARSET_STRIPSPACE to be passed down to the translation
layer. Would people be happy with that as an interface? It's
somewhat invasive, needing changes through lots of imap/*.c and
sieve/*.c files.
Bron.
More information about the Cyrus-devel
mailing list