[Email-SIG] [Python-Dev] headers api for email package

Stephen J. Turnbull stephen at xemacs.org
Tue Apr 14 06:48:52 CEST 2009


Removing Python-Dev from the addressees.

Steven D'Aprano writes:
 > On Tue, 14 Apr 2009 03:15:20 am Stephen J. Turnbull wrote:
 > 
 > > *People* see email as (rich-)text.
 > 
 > We do?

Yup.  You don't see the email, you see a *presentation* of that email.
That presentation is usually text, plus possible some other stuff
(fonts, highlighting, active links, images).  Thus the "(rich-)".

 > It's not clear what you actually mean by "(rich-)text".

I mean presentation.  I mean "human readable".  I mean Unicode.  I
mean "Do Not Feed The Program" (not for machine processing -- so your
associations with virii are completely off the mark).

 > rich-text. I guess you mean Unicode characters. Am I right?

No.  I mean presentation, which for Python purposes includes but is
not limited to Unicode.

 > Now, correct me if I'm wrong, but I don't think mail headers can 
 > actually be anything *but* bytes.

On the wire.  email's Headers have applications other than putting
bytes on the wire.

 > If you're proposing converting those bytes into characters, that's all 
 > very well and good, but what's your strategy for dealing with the 
 > inevitable wrongly-formatted headers?

Whatever you want it to be.  There are a number of such strategies,
some of which should be among the batteries we include.
Header.__str__() will need to know how to find out which is in effect,
of course.

 > If the header can't be correctly decoded into text, there still
 > needs to be a way to get to the raw bytes.

Sure.  That's what Header.__bytes__() will do.  Specifically, if you
have a Header that was parsed out of a message received over the wire,
it will return a verbatim copy of the header as received, folding
whitespace, CRLFs, and all.  If the Header was constructed (including
editing a received header), then __bytes__ will construct the wire
format, and optionally cache it as if it were a received header.  (But
this has some gotchas, see below.)

 > >  > > bytes(message['Subject'])
 > >
 > > gives wire format.  Yow!  I think I'm just joking.  Right?
 > 
 > Er, I'm not sure. Are you joking? I hope not, because it is important to 
 > be able to get to the raw, unmodified bytes that the MTA sees, without 
 > all the fancy processing you suggest.

Er, I'm not suggesting any processing in particular.  I'm suggesting
an API in which str(header) produces a text/plain rendering of the
field contents, with no folding, MIME words, or other wire format
detritus, suitable for human viewing, more or less (specifically, it
might be a rather long line).  bytes(header) produces the wire format,
either verbatim as received or as constructed based on client input.

Note that an issue here is that a received header may be bogus, in
which case you *don't* want bytes(header) to simply return the
original and then spew over the wire.  Should it raise an Exception or
"fix up" the bytes?  I don't know, and thus I wonder if this proposed
API might just be a joke, not something you can dare use in a
production application.

Of course, str() and bytes() as proposed here are not necessarily what
you want.  So there will need to be ways to access the internal
representation of Header directly (or via further specialized
formatter functions if string or bytes format is preferred to
structured objects).

 > Again, correct me if I'm wrong, but *all* valid mail headers must fit in 
 > ASCII.

Of course, that's true on the wire.  I've assumed that everybody here
is assuming STD 11 (currently RFC 822 according to rfc-editor.org)
folding of long header lines and RFC 2047 encoding of characters
outside of the restricted-ASCII repertoire (RFC 5322 at least doesn't
permit all the ASCII control characters) before putting it on the
wire.  This is basically a solved problem, though, so I didn't bother
mentioning it.  Sorry for the confusion.

But what we're talking about *here* are email APIs that may or may not
be directly connected to a display or wire.  There is no reason why
headers *must* be represented as bytes, strings, or anything else in a
Header, and no reason why the bytes or str format *must* be RFC
compatible.

I think it's quite sensible to specify "bytes(header) will be RFC
5322-conforming", but we need to specify how to handle bogus headers
that we have received and not edited.  Should we ever raise an
Exception, and if so, in what contexts?  Should we "fix up" the
bogosity somehow?  Should we delete the offensive header?  Should we
pass it on verbatim, and leave it to a higher level to verify(!) and
decide what to do about it?  Do the RFCs say anything about all this
(eg, with broken trace headers I think it's implied that we pass them
on verbatim)?


More information about the Email-SIG mailing list