How to filter based on "garbage" subjects ... ?

Tue Sep 30 12:06:52 EDT 2003

Hi Marc,

--On Tuesday, September 30, 2003 11:32 -0300 "Marc G. Fournier" 
<scrappy at hub.org> wrote:

|
| I've yet to be able to come up with a sieve rule that will allow me to
| filter all "garbage" subjects to a separate folder ... you know the ones
| that look like:
|
| Subject: =?euc-kr?q?(=B1=A4=B0=ED)=B5=F0=C1=F6=
|
| I've even tried to use Pine filtering to filter based on 8bit subjects,
| but it doesn't pick them up either ...
|
| For instance, under Pine, if I try to select all subjects with =B1= in
| them, which the above contains, it selects nothing, so I'm figuring there
| has to be some control characters in there somewhere ... ?
|
| Thoughts?
|

>From the SIEVE RFC:

|       Implementations decode header charsets to UTF-8.  Two strings are
|       considered equal if their UTF-8 representations are identical.
|       Implementations should decode charsets represented in the forms
|       specified by [MIME] for both message headers and bodies.
|       Implementations must be capable of decoding US-ASCII, ISO-8859-1,
|       the ASCII subset of ISO-8859-* character sets, and UTF-8.

i.e. SIEVE should be decoding the =?euc-kr?.... header into its utf8 form 
BEFORE doing the comparison with the text you provide. i.e. the =B1 
quoted-printable encoded character will have been decoded into the utf8 
representation of that for the euc-kr character set, and thus won't match 
the text you provide. Actually the euc-ky character set is a multibyte 
character set so in fact the unicode character is made up of =B1 and =A4. 
By my reckoning that is the unicode character 0xad11 - I'll leave you to 
work out the utf8 encoding of that!

Basically you are going to have a hard time trying to filter on arbitrary 
unicode characters in some random character set given that sieve expects 
utf8 in its scripts.

-- 
Cyrus Daboo