adding multi-lingual support to cyrus sieve
mark at air.co.jp
Thu Dec 12 23:33:24 EST 2002
While the IMAP server does a great job of supporting multiple languages
for searching. The level of support provided by sieve is much much lower.
Here is an outline of the modifications to cyrus and sieve, which I think
are necessary to enable multi-lingual support in cyrus sieve. I intend
to make a set of bugs from this outline and add them to bugzilla. Prior
to that I'd like to solicit comments regarding completeness. In other
words, have I missed anything obvious? My slant is getting Asian languages
supported so Europeans may have been left out in the cold.
I. Lib charset modifications to support sieve usage of utf-8.
A. Supported charset to Unicode (UTF-8) conversion
Characters in a supported charset must be converted to the
corresponding character in Unicode. Note that in some cases
multiple characters from the supported charset may be reduced
to a single Unicode character.
B. UTF-8 to supported charset conversion
Convert Unicode characters encoded by UTF-8 into the cooresponding
character in a supported charset. Note some Unicode characters may
not exist in the supported charset, such non-existing characters
should be replaced with a substitute character for example: ?. Also
note that a Unicode character may be have been reduced from multiple
characters in the supported charset the most commonly used character
should be selected.
These modifications may be accomplished by replacing the current character
transformation tables with charset-to-from-unicode and
unicode-to-transform-string tables. Since the Unicode to UTF-8
transformation is algorithmic, a table isn't strictly necessary.
II. Add UTF-8 Support to Sieve.
A. Header decoding and charset conversion for comparing message headers.
Message headers which are RFC2047(RFC1522)+RFC2231 encoded using
a supported charset must be decoded to UTF-8 for comparison.
B. Script charset verification
1. Verify that submitted scripts are UTF-8 encoded or reject.
2. Verify MIME encoded strings have charset specifed and matches text.
a. For text/*, a charset should be specified or defaulted.
i. The default default charset is the MIME default.
ii. A configured default specified by system administrator.
b. The text must be correctly encoded.
i. Verify it is the specified charset.
ii. If not specified, verify it is the default charset.
3. When a script is rejected due to incorrect encoding the NO response
should contain the reason.
a. Line # string has 8 bit data which is not utf-8 encoded
b. Line # mime text has no charset specified.
c. Line # mime text not encoded in specified charset.
C. Fileinto should convert UTF-8 mailbox names to the MUTF-7 used by
the IMAP server. This is already in bugzilla.
III. MIME and charset support for Sieve notification/vacation messages.
A. Header decoding and charset conversion of message headers included
in notification messages.
Message headers which are to included in a notification or
vacation message and are RFC2047(RFC1522)+RFC2231 encoded using
a supported charset must be decoded and converted to UTF-8 for
inclusion in the message generated by sieve.
B. Allow a script to specify the charset for sieve generated messages.
1. Add a :charset option to notify and vacation. For example:
vacation :charset "iso-2022-jp" :subject "私は休み中です。"
The :charset option sets the charset to be used when the
message is sent. The strings in the script are still required
to be UTF-8 encoded.
a. The subject will be converted from UTF-8 to ISO-2022-JP
and RFC2047 encoded.
b. The message body will be converted from UTF-8 to ISO-2022-JP
before being sent.
2. The specified charset must be a supported charset; otherwise,
the script is rejected.
C. Allow the system administrator to configure a default charset to be
used when sending sieve generated messages.
1. The configured charset is read from imapd.conf and set by
the sieve-message-charset parameter.
2. If the configured charset is not supported, sieve must log an
error stating that the configured charset is not supported and
that it is using the default charset (UTF-8) instead.
3. The subject and text for Sieve generated messages must be
converted to the configured charset if supported.
Note: When configuring the charset, it may also be desirable to set
the content-transfer-encoding was well.
D. Make Sieve generate MIME messages.
Unless a MIME body is already specified, sieve generated messages
should be converted to MIME format.
1. If the message contains non-US-ASCII UTF-8 characters, add the
following headers to the message:
Content-Type: TEXT/PLAIN; charset=utf-8
2. If a charset has been user specified or configured, add the
appropriate headers. For example:
Content-Type: TEXT/PLAIN; charset=iso-8859-8
Planned Bugzilla Submission
I think these have been split into relatively non-overlapping chunks; however,
there are some dependencies.
1. I.A. Supported charset to Unicode (UTF-8) conversion
2. II.A. Header decoding and charset conversion for comparing message headers.
3. II.B.1. Verify that submitted scripts are UTF-8 encoded or reject.
4. II.B.2. Verify MIME encoded strings have charset specifed and matches text.
5. III.A. Header decoding and charset conversion of message headers included in notification messages
6. III.D. Make Sieve generate MIME messages.
7. I.B. UTF-8 to supported charset conversion
8. III.B. Allow a script to specify the charset for sieve generated messages.
9. III.C. Allow the system administrator to configure a default charset for sieve generated messages.
Mark Keasling <mark at air.co.jp>
More information about the Info-cyrus