[Python-Dev] [Email-SIG] headers api for email package
Steven D'Aprano
steve at pearwood.info
Mon Apr 13 20:32:25 CEST 2009
On Tue, 14 Apr 2009 03:15:20 am Stephen J. Turnbull wrote:
> *People* see email as (rich-)text.
We do?
It's not clear what you actually mean by "(rich-)text". In the context
of email, I understand it to mean HTML in the body, web-bugs, security
exploits, 36pt hot-pink bold text on a lime-green background, and all
the other wonderful things modern mail clients let you put in your
email. But as far as I know, no mail client tries to render HTML tags
inside mail headers, so you're probably not talking about HTML
rich-text. I guess you mean Unicode characters. Am I right?
Now, correct me if I'm wrong, but I don't think mail headers can
actually be anything *but* bytes. I see that my mail client, at least,
sends bytes in the Subject header. If I try to send characters, e.g.
the subject header "Testing-β-" (without the quotes), what actually
gets sent is the bytes "=?utf-8?q?Testing-=CE=B2-?=" (again without the
quotation marks). This seems to be covered by RFC 2047:
http://tools.ietf.org/html/rfc2047
If you're proposing converting those bytes into characters, that's all
very well and good, but what's your strategy for dealing with the
inevitable wrongly-formatted headers? If the header can't be correctly
decoded into text, there still needs to be a way to get to the raw
bytes. Apart from (e.g.) mail processing apps like SpamBayes which will
want to inspect the raw bytes, mail readers will need to deal with
badly formatted mail. The RFC states:
"However, a mail reader MUST NOT prevent the display or handling of a
message because an 'encoded-word' is incorrectly formed."
[...]
> Then MTAs see email as a string of octets. So guess what:
>
> > > bytes(message['Subject'])
>
> gives wire format. Yow! I think I'm just joking. Right?
Er, I'm not sure. Are you joking? I hope not, because it is important to
be able to get to the raw, unmodified bytes that the MTA sees, without
all the fancy processing you suggest.
[...]
> Otherwise, you should have a unicode, and you simply look
> at the range of the string. If it fits in ASCII, Bob's your uncle.
> If not, Bob's your aunt (and you use UTF-8).
Again, correct me if I'm wrong, but *all* valid mail headers must fit in
ASCII. RFC 5335 defines an experimental approach to allowing full
Unicode in mail headers, but surely it's going to be a while before
that's common, let alone standard.
http://tools.ietf.org/html/rfc5335
--
Steven D'Aprano
More information about the Python-Dev
mailing list