[Python-Dev] [Email-SIG] headers api for email package

Mon Apr 13 20:32:25 CEST 2009

On Tue, 14 Apr 2009 03:15:20 am Stephen J. Turnbull wrote:

> *People* see email as (rich-)text.

We do?

It's not clear what you actually mean by "(rich-)text". In the context 
of email, I understand it to mean HTML in the body, web-bugs, security 
exploits, 36pt hot-pink bold text on a lime-green background, and all 
the other wonderful things modern mail clients let you put in your 
email. But as far as I know, no mail client tries to render HTML tags 
inside mail headers, so you're probably not talking about HTML 
rich-text. I guess you mean Unicode characters. Am I right?

Now, correct me if I'm wrong, but I don't think mail headers can 
actually be anything *but* bytes. I see that my mail client, at least, 
sends bytes in the Subject header. If I try to send characters, e.g. 
the subject header "Testing-β-" (without the quotes), what actually 
gets sent is the bytes "=?utf-8?q?Testing-=CE=B2-?=" (again without the 
quotation marks). This seems to be covered by RFC 2047:

http://tools.ietf.org/html/rfc2047

If you're proposing converting those bytes into characters, that's all 
very well and good, but what's your strategy for dealing with the 
inevitable wrongly-formatted headers? If the header can't be correctly 
decoded into text, there still needs to be a way to get to the raw 
bytes. Apart from (e.g.) mail processing apps like SpamBayes which will 
want to inspect the raw bytes, mail readers will need to deal with 
badly formatted mail. The RFC states:

"However, a mail reader MUST NOT prevent the display or handling of a 
message because an 'encoded-word' is incorrectly formed."

[...]
> Then MTAs see email as a string of octets.  So guess what:
>
>  > > bytes(message['Subject'])
>
> gives wire format.  Yow!  I think I'm just joking.  Right?

Er, I'm not sure. Are you joking? I hope not, because it is important to 
be able to get to the raw, unmodified bytes that the MTA sees, without 
all the fancy processing you suggest.

[...]
> Otherwise, you should have a unicode, and you simply look
> at the range of the string.  If it fits in ASCII, Bob's your uncle.
> If not, Bob's your aunt (and you use UTF-8).

Again, correct me if I'm wrong, but *all* valid mail headers must fit in 
ASCII. RFC 5335 defines an experimental approach to allowing full 
Unicode in mail headers, but surely it's going to be a while before 
that's common, let alone standard.

http://tools.ietf.org/html/rfc5335

-- 
Steven D'Aprano