adding multi-lingual support to cyrus sieve

Mark Keasling mark at air.co.jp
Thu Dec 12 23:33:24 EST 2002


Hi,

While the IMAP server does a great job of supporting multiple languages
for searching.  The level of support provided by sieve is much much lower.

Here is an outline of the modifications to cyrus and sieve, which I think
are necessary to enable multi-lingual support in cyrus sieve.  I intend
to make a set of bugs from this outline and add them to bugzilla.  Prior
to that I'd like to solicit comments regarding completeness.  In other
words, have I missed anything obvious?  My slant is getting Asian languages
supported so Europeans may have been left out in the cold.

I.  Lib charset modifications to support sieve usage of utf-8.
    A.  Supported charset to Unicode (UTF-8) conversion
        Characters in a supported charset must be converted to the
        corresponding character in Unicode.  Note that in some cases
        multiple characters from the supported charset may be reduced
        to a single Unicode character.
    B.  UTF-8 to supported charset conversion
        Convert Unicode characters encoded by UTF-8 into the cooresponding
        character in a supported charset.  Note some Unicode characters may
        not exist in the supported charset, such non-existing characters
        should be replaced with a substitute character for example: ?.  Also
        note that a Unicode character may be have been reduced from multiple
        characters in the supported charset the most commonly used character
        should be selected.
    These modifications may be accomplished by replacing the current character
    transformation tables with charset-to-from-unicode and
    unicode-to-transform-string tables.  Since the Unicode to UTF-8
    transformation is algorithmic, a table isn't strictly necessary.

II.  Add UTF-8 Support to Sieve.
     A.  Header decoding and charset conversion for comparing message headers.
         Message headers which are RFC2047(RFC1522)+RFC2231 encoded using
         a supported charset must be decoded to UTF-8 for comparison.
     B.  Script charset verification
         1.  Verify that submitted scripts are UTF-8 encoded or reject.
         2.  Verify MIME encoded strings have charset specifed and matches text.
             a.  For text/*, a charset should be specified or defaulted.
                 i.  The default default charset is the MIME default.
                 ii.  A configured default specified by system administrator.
             b.  The text must be correctly encoded.
                 i.  Verify it is the specified charset.
                 ii.  If not specified, verify it is the default charset.
         3.  When a script is rejected due to incorrect encoding the NO response
             should contain the reason.
             a.  Line # string has 8 bit data which is not utf-8 encoded
             b.  Line # mime text has no charset specified.
             c.  Line # mime text not encoded in specified charset.
     C.  Fileinto should convert UTF-8 mailbox names to the MUTF-7 used by
         the IMAP server.  This is already in bugzilla.

III.  MIME and charset support for Sieve notification/vacation messages.
      A.  Header decoding and charset conversion of message headers included
          in notification messages.
          Message headers which are to included in a notification or
          vacation message and are RFC2047(RFC1522)+RFC2231 encoded using
          a supported charset must be decoded and converted to UTF-8 for
          inclusion in the message generated by sieve.
      B.  Allow a script to specify the charset for sieve generated messages.
          1.  Add a :charset option to notify and vacation.  For example:
                  vacation :charset "iso-2022-jp" :subject "私は休み中です。"
                      "11月10日から15日まで私は休みです。"
              The :charset option sets the charset to be used when the
              message is sent.  The strings in the script are still required
              to be UTF-8 encoded.
              a.  The subject will be converted from UTF-8 to ISO-2022-JP
                  and RFC2047 encoded.
              b.  The message body will be converted from UTF-8 to ISO-2022-JP
                  before being sent.
          2.  The specified charset must be a supported charset; otherwise,
              the script is rejected.
      C.  Allow the system administrator to configure a default charset to be
          used when sending sieve generated messages.
          1.  The configured charset is read from imapd.conf and set by
              the sieve-message-charset parameter.
          2.  If the configured charset is not supported, sieve must log an
              error stating that the configured charset is not supported and
              that it is using the default charset (UTF-8) instead.
          3.  The subject and text for Sieve generated messages must be
              converted to the configured charset if supported.
          Note: When configuring the charset, it may also be desirable to set
          the content-transfer-encoding was well.
      D.  Make Sieve generate MIME messages.
          Unless a MIME body is already specified, sieve generated messages
          should be converted to MIME format.
          1.  If the message contains non-US-ASCII UTF-8 characters, add the
              following headers to the message:
                  MIME-Version: 1.0
                  Content-Type: TEXT/PLAIN; charset=utf-8
                  Content-Transfer-Encoding: quoted-printable
          2.  If a charset has been user specified or configured, add the
              appropriate headers.  For example:
                  MIME-Version: 1.0
                  Content-Type: TEXT/PLAIN; charset=iso-8859-8

Planned Bugzilla Submission
I think these have been split into relatively non-overlapping chunks; however,
there are some dependencies.
--necessary-for-multi-linugual-support-in-sieve--
1.  I.A.  Supported charset to Unicode (UTF-8) conversion 
2.  II.A.  Header decoding and charset conversion for comparing message headers.
3.  II.B.1.  Verify that submitted scripts are UTF-8 encoded or reject.
4.  II.B.2.  Verify MIME encoded strings have charset specifed and matches text.
5.  III.A.  Header decoding and charset conversion of message headers included in notification messages
6.  III.D.  Make Sieve generate MIME messages.
--not-absolutely-critical-but-necessary-in-some-environments--
7.  I.B.  UTF-8 to supported charset conversion
8.  III.B.  Allow a script to specify the charset for sieve generated messages.
9.  III.C.  Allow the system administrator to configure a default charset for sieve generated messages.

Regards,
Mark Keasling <mark at air.co.jp>





More information about the Info-cyrus mailing list