RFC: Charset Conversion Routines

Tue Feb 24 06:13:47 EST 2009

Bron Gondwana wrote:

>I'm in the process of rewriting the lib/mkchartable.c 
>and lib/charset.c with the eventual goal being a more
>flexible charset conversion API that can be used to
>make sieve rules match on the decoded values, and
>other funky things.
>
>It turns out to be quite a lot of changes.  My initial
>work in progress is up here:
>
>http://github.com/brong/cyrus-imapd/commit/863b5b51dd27f184fa00de4ec5a6aca3308fc30e
>
>As you can see, it's quite a bit of code.
>
>
>Anyway - I'd like some feedback on a couple of things:
>
>a) It's going to use a little more CPU this way, because
>   instead of having a table that converts _directly_ from
>   the source charset to utf-8 in search-canonical-form, 
>   it does one conversion to unicode characters (16bit), 
>   then another table converts that into a stream of zero
>   to 15 characters (yes, something expands to 15 separate 
>   codepoints, no, I don't want to know what it is!)  
>
>   Finally a third pass converts to utf-8 from the
>   character codepoints.
>
>b) Should we make this 32bit unicode characters while we're
>   at it, and extend the UTF-8 converter?
>  
>
Yes!
And upgrade the tables to Unicode 5.1.0.
And also change the normalization to conform to RFC 5051.

>c) For that matter, should we just be outsourcing all this
>   crap to another library?  Does anyone know a good library
>   that can do what Cyrus does (take one character at a time
>   and keep state?)
>  
>
I am not sure about that, but if people know a good library...