RFC: Charset Conversion Routines

Bron Gondwana brong at fastmail.fm
Mon Feb 23 22:17:26 EST 2009


I'm in the process of rewriting the lib/mkchartable.c 
and lib/charset.c with the eventual goal being a more
flexible charset conversion API that can be used to
make sieve rules match on the decoded values, and
other funky things.

It turns out to be quite a lot of changes.  My initial
work in progress is up here:

http://github.com/brong/cyrus-imapd/commit/863b5b51dd27f184fa00de4ec5a6aca3308fc30e

As you can see, it's quite a bit of code.


Anyway - I'd like some feedback on a couple of things:

a) It's going to use a little more CPU this way, because
   instead of having a table that converts _directly_ from
   the source charset to utf-8 in search-canonical-form, 
   it does one conversion to unicode characters (16bit), 
   then another table converts that into a stream of zero
   to 15 characters (yes, something expands to 15 separate 
   codepoints, no, I don't want to know what it is!)  

   Finally a third pass converts to utf-8 from the
   character codepoints.

b) Should we make this 32bit unicode characters while we're
   at it, and extend the UTF-8 converter?

c) For that matter, should we just be outsourcing all this
   crap to another library?  Does anyone know a good library
   that can do what Cyrus does (take one character at a time
   and keep state?)

d) Whitespace compression.  I'm currently mapping all
   whitespace to ' ' instead of '', and then either stripping
   all ' ' from the string, or only outputting them if the
   previous character on the output string was not a space.
   Rob tells me that there are some issues with asian charsets
   and space not having any meaning - how best to handle?

e) Interfaces, interfaces, interfaces.  At the moment we have:

* charset_compilepat - for use in:
  * charset_searchstring
  * charset_searchfile
* charset_decode_mimebody - and
  * charset_encode_mimebody
* charset_extractfile

My current implementation that I'm working on uses "int flags"
as an extra parameter to each of these, allowing CHARSET_CANON
and CHARSET_STRIPSPACE to be passed down to the translation
layer.  Would people be happy with that as an interface?  It's
somewhat invasive, needing changes through lots of imap/*.c and
sieve/*.c files.

Bron.


More information about the Cyrus-devel mailing list