RFC: Charset Conversion Routines
Alexey Melnikov
alexey.melnikov at isode.com
Tue Feb 24 06:13:47 EST 2009
Bron Gondwana wrote:
>I'm in the process of rewriting the lib/mkchartable.c
>and lib/charset.c with the eventual goal being a more
>flexible charset conversion API that can be used to
>make sieve rules match on the decoded values, and
>other funky things.
>
>It turns out to be quite a lot of changes. My initial
>work in progress is up here:
>
>http://github.com/brong/cyrus-imapd/commit/863b5b51dd27f184fa00de4ec5a6aca3308fc30e
>
>As you can see, it's quite a bit of code.
>
>
>Anyway - I'd like some feedback on a couple of things:
>
>a) It's going to use a little more CPU this way, because
> instead of having a table that converts _directly_ from
> the source charset to utf-8 in search-canonical-form,
> it does one conversion to unicode characters (16bit),
> then another table converts that into a stream of zero
> to 15 characters (yes, something expands to 15 separate
> codepoints, no, I don't want to know what it is!)
>
> Finally a third pass converts to utf-8 from the
> character codepoints.
>
>b) Should we make this 32bit unicode characters while we're
> at it, and extend the UTF-8 converter?
>
>
Yes!
And upgrade the tables to Unicode 5.1.0.
And also change the normalization to conform to RFC 5051.
>c) For that matter, should we just be outsourcing all this
> crap to another library? Does anyone know a good library
> that can do what Cyrus does (take one character at a time
> and keep state?)
>
>
I am not sure about that, but if people know a good library...
More information about the Cyrus-devel
mailing list