RFC: Charset Conversion Routines

Wed Mar 4 23:13:58 EST 2009

On Tue, Feb 24, 2009 at 02:17:26PM +1100, Bron Gondwana wrote:
> I'm in the process of rewriting the lib/mkchartable.c 
> and lib/charset.c with the eventual goal being a more
> flexible charset conversion API that can be used to
> make sieve rules match on the decoded values, and
> other funky things.

OK - significantly more work done.  It's now working correctly
on my testbed.  Behaviour is identical to the old code, using
a reasonably large snapshot of email I had lying around.
Reconstruct creates an idential cyrus.cache and downloads, etc
all work correctly.

There are some pathological error cases that might work
slightly differently, though I got rid of the worst of those
with a change to the couple of charset.t files that had
GOSUBs in them.

http://github.com/brong/cyrus-imapd/commit/78591d3a5c2f2ed5cd4d1bf935fffa073081198c

brong at launde:/extra/src/git/cmu/cyrus-imapd$ git diff origin/master |
diffstat
 b/imap/.cvsignore            |    1
 b/imap/Makefile.in           |    6
 b/imap/cyr_charset.c         |  160
 b/lib/Makefile.in            |   15
 b/lib/charset.c              | 1794 +--
 b/lib/charset.h              |   22
 b/lib/charset/iso-2022-jp.t  |   33
 b/lib/charset/iso-2022-kr.t  |   12
 b/lib/charset/unidata5_1.txt |19336 +++++++++++++++++++++++++++++++++++++++++++
 b/lib/chartable.h            |   27
 b/lib/mkchartable.pl         |  531 +
 lib/charset/unidata2.txt     | 6629 --------------
 lib/mkchartable.c            |  974 --
 13 files changed, 20866 insertions(+), 8674 deletions(-)

Yikes!

So the changes to the .t files just conver the ESC tables
into multibyte sequences for all valid escape codes in all
mode tables, allowing the "invalid escape code" to drop you
back in the current mode again.

cyr_charset is just a little tool to allow you to see what
the input in a particular charset produces as output.

unidata5.1 overrides everything, because that's huge.  I've
made the code able to support the latest unicode standard
including 24 bit codepoints.

mkchartable is rewritten in perl rather than C, because it
was so very, very much easier.  I'd be willing to convert
it back if people really, really don't want to depend on
Perl, but I'd probably *sigh* a lot.

charset.c is pretty much totally rewritten.  Git tells me
it's 70% changed, and actually logs it as a "rewrite"
when committing.  Major, major changes to how just about
everything works.

All "translations" are chainable, so you write code like
this:

struct convert_rock *translate = qp_init();
struct convert_rock *decode = table_init(charset);
struct convert_rock *canon = canon_init();
struct convert_rock *toutf8 = uni_init();
struct convert_rock *tobuffer = buffer_init(0, 0);

translate->next = decode;
decode->next = canon;
canon->next = toutf8;
toutf8->next = tobuffer;

convert_cat(translate, s);

res = buffer_cstring(tobuffer);

basic_free(translate);
basic_free(decode);
basic_free(canon);
basic_free(toutf8);
buffer_free(tobuffer);

And you have a freshly malloced cstring in "res".
It's annoying that you have to alloc and free in
so many lines, but otherwise the API is simple to
use, and easy to mix-and-match as required.  Each
layer has a "state" object, and gets called with
a single character.  It changes its state as
requried, and possibly calls convert_putc on its
"next" pointer with translated characters.

Overall, it's a code saving, and it makes doing
things like converting to utf8 rather than search
form as easy as removing a translation layer.

On the downside, it does cost a little more CPU
for all the extra function calls, as opposed to
the direct A => B translation tables of the old
way.  Running a full squatter on my mailboxes
cost about 10% more CPU this way.  I think it's
justified for the flexibility and full unicode
handling capability and all that jazz.

I still need to document how this stuff works,
in particular how the almost-stateless search
consumer works!  I've explained it on paper to
Rob and Richard to make sure it makes sense to
them...

Comments and code review gratefully appreciated!
I'll be doing some more testing here, and then
possibly pushing it to production on one of our
machines for a full smoketest!

Bron.