8-bit characters in headers

Adrian Buciuman impersonala at gmail.com
Thu Apr 28 19:48:20 EDT 2005


On 4/28/05, Henrique de Moraes Holschuh <hmh at debian.org> wrote:

> 
> What you *have* to do to get such a thing accepted is to, instead, write
> something that *fixes* the headers with the following capabilities:
> 
> 1. Notion of a default source charset, which is a hint of the charset to
>   encode *from* (because the input data does not have that information)
> 
> 2. Either a configurable destination charset, or use UTF-8 (I would much
>   rather you went the full way and made it configurable, I believe at
>   least the CJK people would appreciate that a lot).
> 
> 3. Functionality:
>   3.1:  Detect illegal 8-bit in headers, and apply the correction
>         algorithm described below (configurable)
>   3.2:  Pass-through any non-8bit headers.
>   3.3:  Reject messages with 8-bit headers.
> 
> Algo for charset conversion:
> 
> Step 1:  Look for certain hints of charsets, to try do determine the
>  correct source charset: UTF signatures, ISO-2022 escape sequences, etc.
> 
> Step 2:  If not found, use the default source charset.
> 
> Step 3:  Verify if the input sequence is *100% valid and correct* in the
>  choosen/detected charset.  If it is not, reject the message.
> 
> Step 4:  Convert to the destination charset (option: detected
>  charset, configured destination charset), and RFC-2047 encode the
>  header.
> 
> This needs to be done *before* any sieve processing, etc.
> 
> So far, nobody that keeps complaining about the "X" things has taken the
> time to do the above.
> 
Hello,

Why not encode the header using unknown-8bit as the charset?? This is
simpler, and has many advantages:

1. No information is lost. No silent change is made to a message. No
confusion can be caused by converting a valid word/name into a
different valid word/name. (Silently converting to "X" is regarding as
unethical by many network administrators. It can cause pain to
innocent users.  Imagine a message subject which is converted into:
"Our beloved Xenia has died", but Xenia is alive . In Europe not all
letters are 8-bits, so you will not receive XXXXXXXX but some possible
meaningful text.   X-ing can cause damage to business, can be contrary
to contracts, can be illegal.)

If you can not do good, at least do not cause harm.

Rejecting mail be also be unethical (important messages will not be
delivered because of a 8-bit character) and can also cause problems.
If  it is a mailing list message, the receiver will be silently
unsubscribed and the receiver will not receive any error to fix his
mailer.

2. It is theoretically reversible at any time in the future. (If
messages are later reprocessed, archived,  download using fetchmail
etc)

3. unknown-8bit is registered by IANA
http://www.iana.org/assignments/character-sets
It is used by some mail programs. See
http://www.google.com/search?q=unknown-8bit&start=10&sa=N

4. Broken messages can be found with a search at any time. (but false
positive are possible) Senders can be notified. Statistics can be
made. Filters can be used (like: if message is from newsgroup
soc.culture.xyz  and contains .......do .........)

5. MUAs can use whatever heuristic want to find the real charset. They
may use user preferences, system defaults,  may use rules specific for
a certain newsgroup or mailing-list, or for a certain sender, they may
check if a subject is plausible using a dictionary. They may also warn
the user, display the undecoded text and so on.  I've done tests with
some MUAs with good results. Some of them include support for
unknown-8bit because it can be used in  contexts like "Content-Type:
text/plain; charset=unknown-8bit" ( I believe sendmail may generate
this, following RFC 1428. How will Cyrus search in such a message,
BTW?). They seem happy to accept it in RFC-2047 header, possible as an
unintended side-effect of the charset-handling code. I don't know if
they use the same charset detection code as in a message with raw
8-bit characters or a different one. Others will treat it as an
unknown charset and display the undecoded string. I've found none, but
there may be some broken programs which will crash, corrupt the mail
store or execute the mail content.

6. This will fix this Cyrus problem for ever. (By contrast, no
heuristic can achieve this: it will need to be adapted, patched,
improved and still it will not be perfect, it may find the wrong
charset and cause confusions for users. The closer the heuristic is to
the user, the better will work. So MUAs should guess the charset)

7. The same code can be used to implement a site-default. (Replace
unknown-8bit with what you want and __do__ some plausibility checks)

8. It will work at any point in the mail path. Cyrus will do this, but
MTAs and news servers can also convert headers to unknown-8bit. Or
reverse it, if necessary.

PROBLEMS:

1. How to handle appending messages got using other servers/protocols
to a Cyrus mailbox.   I don't know imap protocol enough, but I believe
the client will have the old  headers and not expect them to be
changed, so may crash/corrupt the messages or do some other unwanted
things.  How is this handled with X-ing ??  Rejecting and doing some
conversion  on the client store before appending may be a solution.

2.  8-bits in bodies is not fixed by this proposal. MTAs should handle
this via MIME. Cyrus should reject 8bits in bodies.

3 What to do with a message inside a message. How do present code
handle this?? Or they are not parsed by Cyrus?? (Haven't tried.)

4. Care should be taken not to rfc2047-encode text which must be
ASCII.  Even when properly encoded, non-ASCII is not valid anywhere in
headers.

Best regards,
Adrian Buciuman

---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html




More information about the Info-cyrus mailing list