[Email-SIG] Thoughts on the general API, and the Header API.

Tue Jan 26 03:51:46 CET 2010

On Mon, 25 Jan 2010 16:55:15 -0800, Glenn Linderman <v+python at g.nevcal.com> wrote:
> On approximately 1/25/2010 12:10 PM, came the following characters from 
> the keyboard of R. David Murray:
> > So, those are my thoughts, and I'm sure I haven't thought of all the
> > corner cases.  The biggest question is, does it seem like this general
> > scheme is worth pursuing?
> 
> If it was stated, I missed it: is  from_full_header  a way of producing 
> an object from a raw data value?  Whereas __init__ would obviously be 

Yes.

> used to produce one from string or bytes values.  If so, then it would 

Well, StringHeader.from_full_header would take a string as input,
while BytesHeader.from_full_headerwould take bytes as input.
__init__ would be used to construct a header in your program:

    StringHeader('MyHeader', 'my value')
    BytesHeader(b'MyHeader', b'my value').

> be a requirement that this from_full_header API would never produce an 
> exception?  Rather it would produce an object with or without defects?

Yes.

> Are there any other *Header APIs that would be required not to produce 
> exceptions?  I don't yet perceive any.

I don't think so.  from_full_header is the only one involved in parsing
raw data.  Whether __init__ throws errors or records defects is an open
question, but I lean toward it throwing errors.  The reason there is an
open question is because an email manipulating application may want to
convert to text to process an incoming message, and there are things
that a BytesHeader can hold that would cause errors when encoded to a
StringHeader (specifically, 8 bit bytes that aren't transfer encoded).
So it may be that decode, at least, should not throw errors but instead
record additional defects in the resulting StringHeader.  I think that
even in that case __init__ should still throw errors, though; decode
could deal with the defects before calling StringHeader.__init__, or
(more likely) catch the errors throw by __init__, fix/record the defects,
and call it again.

Note, by the way, that by 'raw data' I mean what you are feeding in.
Raw data fed to a BytesHeader would be bytes, but raw data fed to
a StringHeader would be text (eg: if read from a file in text mode).

> The "charset" parameter... is that not mostly needed for data parts?

No, if you start with a unicode string in a StringHeader, you need to
know what charset to encode the unicode to and therefore to specify as
the charset in the RFC 2047 encoded words.

> Headers are either ASCII, or contain self-describing charset info.

That's true for BytesHeaders, but not for StringHeaders.  So as I
said above charset for StringHeader says which charset to put into
the encoded words when converting to BytesHeader form.

I specified a charset parameter for 'decode' only to handle the case
of raw bytes data that contains 8 bit data that is not in encoded words
(ie: is not RFC compliant).  I am visualizing this as satisfying a use
case where you have non-email (non RFC compliant) data where you allow
8 bit data in the header bodies because it's in internal ap and you
know the encoding.  You can then use decode(charset) to decode those
BytesHeaders into StringHeaders.

> I guess I could see an intermediate decode from string to some charset, 
> before serialization, as a hint that when generating headers, that all 
> the characters in the header that are not ASCII are in the specified 
> charset... and that that charset is the one to be used in the 
> self-describing serialized ASCII stream?  The full generality of the 

Exactly.

> RFCs, however,
> allows pieces of headers to be encoded using different charsets... with 
> this API, it would seem that that could only be created containing one 
> charset... the serialization primitives were made available, so that 
> piecewise construction of a header value could be done with different 
> charsets, and then the from_full_header API used to create the complex 
> value.  I don't see this as a severe limitation, I just want to 
> understand your intention, and document the limitation, or my 
> misunderstanding.

Right.  I'm visualizing the "normal case" being encoding a StringHeader
using the default utf-8 charset or another specified charset, turning
the words containing non-ASCII characters into encoded words using that
charset.  The utility methods that turn unicode into encoded words would
be exposed, and an application that needs to create a header with mixed
charsets can use those utilities to build RFC compliant bytes data and
pass that to one of the BytesHeader constructors.  (Make the common case
easy, and the complicated cases possible.)

> > BytesHeader would be exactly the same, with the exception of the signature
> > for serialize and the fact that it has a 'decode' method rather than an
> > 'encode' method.  Serialize would be different only in the fact that
> > it would have an additional keyword parameter, must_be_7bit=True.
> 
> I am not clear on why StringHeader's serialize would not need the  
> must_be_7bit  parameter... or do I misunderstand that 
> StringHeader.serialize produces wire-format data?

The latter.  StringHeader serialize does not produce wire-format data,
it produces text (for example, for display to the user).  If you want
wire format, you encode the StringHeader and use the resulting BytesHeader
serialize.

> > The magic of this approach is in those encode/decode methods.
> >
> > Encoding a StringHeader would yield a BytesHeader containing the same
> > data, but encoded per RFC2047 using the specified charset.  Decoding a
> > BytesHeader would yield a StringHeader with the same data, but decoded to
> > unicode per RFC2047, with any 8bit parts decoded (in the unicode sense,
> > not the RFC2047 sense) using the specified charset (which would default to
> > ASCII, meaning bare 8bit bytes in headers would throw an error).  (What to
> > with RFC2047 charsets like unknown-8bit is an open question...probably
> > throw an error).
> >    
> 
> Would the encoding to/from StringHeader/BytesHeader preserve the  
> from_full_header  state and value?

My thought is no.  Once you encode/decode the header, your program has
transformed it, and I think it is better to treat the original raw data
as gone.  The motivation for this is that the 'raw data' of a StringHeader
is the *text* string used to create it.  Keeping a bytes string 'raw data'
around as well would get us back into the mess that I developed this
approach to avoid, where we'd need to specify carefully the difference
between handing a header whose 'original' raw data was bytes vs string,
for each of the BytesHeader and StringHeader cases.  Better, I think,
to put the (small) burden on the application programmer: if you want to
preserve the original input data, do so by keeping the original object
around.  Once you mutate the object model, the original raw data for
the mutated piece is gone.

There are some use-case questions here, though, with regards to
preservation of as much original information/format as possible, and how
valuable that is.  I think we'll have to figure that out by examining
concrete use cases in detail.  (It is not something that the current email
package supports very well, by the way...headers currently get modified
significantly in the parse/generate cycle, even without bytes-to-string
transformations happening.)

> > (Encoding or decoding a Message would cause the Message to recursively
> > encode or decode its subparts.  This means you are making a complete
> > new copy of the Message in memory.  If you don't want to do that you
> > can walk the Message and convert it piece by piece (we could provide a
> > generator that does this).)
> 
> Walking it piece by piece would allow the old pieces to be discarded, to 
> save total memory consumption, where that is appropriate.
> 
> Perhaps one generator that would be commonly used, would be to convert 
> headers only, and leave MIME data parts alone, accessing and converting 
> them only with the registered methods?  This would mean that a "complete 
> copy" wouldn't generally be very big, if the data parts were excluded 
> from implicit conversion.  Perhaps the "external storage protocol" might 
> also only be defined for MIME data parts, and walking the tree with this 
> generator would not need to reference the MIME data parts, nor bring 
> them in from "external storage".

That's true.  The Bytes and String versions of binary MIME parts,
which are likely to be the large ones, will probably have a common
representation for the payload, and could potentially point to the same
object.  That breaking of of the expectation that 'encode' and 'decode'
return new objects (in analogy to how encode and decode of strings/bytes
works) might not be a good thing, though.

In any case, text MIME parts have the same bytes vs string issues as
headers do, and should, IMO, be converted from one to the other on
encode/decode.

Another possible approach would be some sort of 'encode/decode on demand'
system, although that would need to retain a pointer to the original
object, which might get us into suboptimal reference cycle difficulties.

These bits are implementation details, though, and don't affect the API
design.

> > raw_header would be the data passed in to the constructor if
> > from_full_header is used, and None otherwise.  If encode/decode call
> > the regular constructor, then this attribute would also act as a flag
> > as to whether or not the header was constructed from raw input data
> > or via program.
> >    
> 
> This _implies_ that  from_full_header always accepts raw data bytes... 
> even for the StringHeader.  And that implies the need for an implicit 
> decode, and therefore, perhaps a charset parameter?  No, not a charset 
> parameter, since they are explicitly contained in the header values.

Your confusion was my confusing use of the term 'raw data' to mean
whatever was input to the from_full_header constructor, which is
bytes for a BytesHeader and text for a StringHeade.

> Decode for header values may not need a charset value at all!

Normally it would not.  charset would be useful in decode only for
non-RFC compliant headers.

> No comments for the rest.

Thanks for your feedback.

--David