[Email-SIG] Thoughts on the general API, and the Header API.

Tue Jan 26 05:10:01 CET 2010

On approximately 1/25/2010 6:51 PM, came the following characters from 
the keyboard of R. David Murray:
> On Mon, 25 Jan 2010 16:55:15 -0800, Glenn Linderman<v+python at g.nevcal.com>  wrote:
>    
>> Are there any other *Header APIs that would be required not to produce
>> exceptions?  I don't yet perceive any.
>>      
> I don't think so.  from_full_header is the only one involved in parsing
> raw data.  Whether __init__ throws errors or records defects is an open
> question, but I lean toward it throwing errors.  The reason there is an
> open question is because an email manipulating application may want to
> convert to text to process an incoming message, and there are things
> that a BytesHeader can hold that would cause errors when encoded to a
> StringHeader (specifically, 8 bit bytes that aren't transfer encoded).
> So it may be that decode, at least, should not throw errors but instead
> record additional defects in the resulting StringHeader.  I think that
> even in that case __init__ should still throw errors, though; decode
> could deal with the defects before calling StringHeader.__init__, or
> (more likely) catch the errors throw by __init__, fix/record the defects,
> and call it again.
>
> Note, by the way, that by 'raw data' I mean what you are feeding in.
> Raw data fed to a BytesHeader would be bytes, but raw data fed to
> a StringHeader would be text (eg: if read from a file in text mode).
>    

Glad you clarified that; it wasn't obvious, without typed parameters to 
the APIs.

I had assumed that serialize and from_full_header would produce/consume 
bytes, and I think that showed up in my comments, and you've probably 
addressed that below.  Of course, the reason that I assumed that, is 
that there are no RFCs to describe a string format email message, either 
on the wire, in memory, or, particularly, stored in a file.  So it is 
really up to the application to define that, if it wants that.  Now 
since py3 has a natural string format manipulation capability, and since 
the emaillib wants to provide the interface between them, I suppose it 
is a somewhat obvious thing that you might want to store a whole email 
message in string format... I say somewhat obvious, because you thought 
of it, but I didn't, until you clarified the above.

Perhaps the reason I didn't think of it, is simply that all the 
currently used email message storage containers of which I am aware use 
wire format.  So using string format for that purpose would require 
inventing a new storage container (perhaps a trivial extension of an 
existing one, but new, nonetheless).  I sort of expected email clients 
would, given the capabilities of the emaillib, simply continue to 
save/read in wire format.  In fact, it may be the only choice of format 
that can completely preserve raw format messages for later processing, 
in the presence of defects.

>> The "charset" parameter... is that not mostly needed for data parts?
>>      
> No, if you start with a unicode string in a StringHeader, you need to
> know what charset to encode the unicode to and therefore to specify as
> the charset in the RFC 2047 encoded words.
>
>    
>> Headers are either ASCII, or contain self-describing charset info.
>>      
> That's true for BytesHeaders, but not for StringHeaders.  So as I
> said above charset for StringHeader says which charset to put into
> the encoded words when converting to BytesHeader form.
>
> I specified a charset parameter for 'decode' only to handle the case
> of raw bytes data that contains 8 bit data that is not in encoded words
> (ie: is not RFC compliant).  I am visualizing this as satisfying a use
> case where you have non-email (non RFC compliant) data where you allow
> 8 bit data in the header bodies because it's in internal ap and you
> know the encoding.  You can then use decode(charset) to decode those
> BytesHeaders into StringHeaders.
>
>    
>> I guess I could see an intermediate decode from string to some charset,
>> before serialization, as a hint that when generating headers, that all
>> the characters in the header that are not ASCII are in the specified
>> charset... and that that charset is the one to be used in the
>> self-describing serialized ASCII stream?  The full generality of the
>>      
> Exactly.
>    

OK, I'm with you now on the charset parameter, for encoding and decoding.

>> RFCs, however,
>> allows pieces of headers to be encoded using different charsets... with
>> this API, it would seem that that could only be created containing one
>> charset... the serialization primitives were made available, so that
>> piecewise construction of a header value could be done with different
>> charsets, and then the from_full_header API used to create the complex
>> value.  I don't see this as a severe limitation, I just want to
>> understand your intention, and document the limitation, or my
>> misunderstanding.
>>      
> Right.  I'm visualizing the "normal case" being encoding a StringHeader
> using the default utf-8 charset or another specified charset, turning
> the words containing non-ASCII characters into encoded words using that
> charset.  The utility methods that turn unicode into encoded words would
> be exposed, and an application that needs to create a header with mixed
> charsets can use those utilities to build RFC compliant bytes data and
> pass that to one of the BytesHeader constructors.  (Make the common case
> easy, and the complicated cases possible.)
>    

Thanks for this clarification also.

>>> BytesHeader would be exactly the same, with the exception of the signature
>>> for serialize and the fact that it has a 'decode' method rather than an
>>> 'encode' method.  Serialize would be different only in the fact that
>>> it would have an additional keyword parameter, must_be_7bit=True.
>>>        
>> I am not clear on why StringHeader's serialize would not need the
>> must_be_7bit  parameter... or do I misunderstand that
>> StringHeader.serialize produces wire-format data?
>>      
> The latter.  StringHeader serialize does not produce wire-format data,
> it produces text (for example, for display to the user).  If you want
> wire format, you encode the StringHeader and use the resulting BytesHeader
> serialize.
>    

OK, I'm with you here now too.  So it may be nice to have a recursive 
operation that would convert String format stuff to Bytes and then to 
wire format, in one go, discarding the intermediate Bytes format stuffh 
along the way to avoid three copies of the data, for simple email 
clients that only use the String format interfaces.

>>> The magic of this approach is in those encode/decode methods.
>>>
>>> Encoding a StringHeader would yield a BytesHeader containing the same
>>> data, but encoded per RFC2047 using the specified charset.  Decoding a
>>> BytesHeader would yield a StringHeader with the same data, but decoded to
>>> unicode per RFC2047, with any 8bit parts decoded (in the unicode sense,
>>> not the RFC2047 sense) using the specified charset (which would default to
>>> ASCII, meaning bare 8bit bytes in headers would throw an error).  (What to
>>> with RFC2047 charsets like unknown-8bit is an open question...probably
>>> throw an error).
>>>        
>> Would the encoding to/from StringHeader/BytesHeader preserve the
>> from_full_header  state and value?
>>      
> My thought is no.  Once you encode/decode the header, your program has
> transformed it, and I think it is better to treat the original raw data
> as gone.  The motivation for this is that the 'raw data' of a StringHeader
> is the *text* string used to create it.  Keeping a bytes string 'raw data'
> around as well would get us back into the mess that I developed this
> approach to avoid, where we'd need to specify carefully the difference
> between handing a header whose 'original' raw data was bytes vs string,
> for each of the BytesHeader and StringHeader cases.  Better, I think,
> to put the (small) burden on the application programmer: if you want to
> preserve the original input data, do so by keeping the original object
> around.  Once you mutate the object model, the original raw data for
> the mutated piece is gone.
>
> There are some use-case questions here, though, with regards to
> preservation of as much original information/format as possible, and how
> valuable that is.  I think we'll have to figure that out by examining
> concrete use cases in detail.  (It is not something that the current email
> package supports very well, by the way...headers currently get modified
> significantly in the parse/generate cycle, even without bytes-to-string
> transformations happening.)
>    

Not every transformation is intended to be a change.  Until there is a 
change, it would be nice to be able to retain the original byte stream, 
for invertibility, without requiring that a simple email client deal 
with bytes interfaces for RFC conformant messages.

I hear you regarding the mess... here's an brainstorming idea, tossed 
out mostly to get your creative juices flowing in this direction, not 
because I think it is "definitely the way to go".  The decode API could, 
in addition to your description, have an option to preserve itself and 
the decode charset, within the String object... If encode "discovers" a 
preserved Bytes object, and the same charset is provided, it would 
return the preserved Bytes object, rather than creating a new one.  
There may be no need to drop the Bytes object explicitly; as it seems 
the only API for making changes to a Header object is to create a new 
one, and substitute the new one for the old one.  Or maybe 
from_full_header does a modify.  Or maybe the properties are assignable 
(that is not explicitly stated, by the way).  So if there are modify 
operations, they should drop the Bytes object.

>>> (Encoding or decoding a Message would cause the Message to recursively
>>> encode or decode its subparts.  This means you are making a complete
>>> new copy of the Message in memory.  If you don't want to do that you
>>> can walk the Message and convert it piece by piece (we could provide a
>>> generator that does this).)
>>>        
>> Walking it piece by piece would allow the old pieces to be discarded, to
>> save total memory consumption, where that is appropriate.
>>
>> Perhaps one generator that would be commonly used, would be to convert
>> headers only, and leave MIME data parts alone, accessing and converting
>> them only with the registered methods?  This would mean that a "complete
>> copy" wouldn't generally be very big, if the data parts were excluded
>> from implicit conversion.  Perhaps the "external storage protocol" might
>> also only be defined for MIME data parts, and walking the tree with this
>> generator would not need to reference the MIME data parts, nor bring
>> them in from "external storage".
>>      
> That's true.  The Bytes and String versions of binary MIME parts,
> which are likely to be the large ones, will probably have a common
> representation for the payload, and could potentially point to the same
> object.  That breaking of of the expectation that 'encode' and 'decode'
> return new objects (in analogy to how encode and decode of strings/bytes
> works) might not be a good thing, though.
>    

Well, one generator could provide the expectation that everything is 
new; another could provide different expectations.  The differences 
between them, and the tradeoffs would be documented, of course, were 
both provided.  I'm not convinced that treating headers and data exactly 
the same at all times is a good thing... a convenient option at times, 
perhaps, but I can see it as a serious inefficiency in many use cases 
involving large data.

This deserves a bit more thought/analysis/discussion, perhaps.  More 
than I have time for tonight, but I may reply again, perhaps after 
others have responded, if they do.

> In any case, text MIME parts have the same bytes vs string issues as
> headers do, and should, IMO, be converted from one to the other on
> encode/decode.
>    

To me, your first phrase implies that they should share common 
encode/decode routines, but not the other.  I can clearly see a use case 
where your opinion is the right approach, but I think there are use 
cases where it might not be... while text MIME parts are generally 
smaller than binary MIME parts, that is neither a requirement, nor 
always true (think about transferring an XML format database... could be 
huge... and is text of sorts -- human decipherable, more easily than hex 
dumps, but not what I would call "human readable").

> Another possible approach would be some sort of 'encode/decode on demand'
> system, although that would need to retain a pointer to the original
> object, which might get us into suboptimal reference cycle difficulties.
>    

Hmm.  Brainstorming again.  decode could minimally create the String 
format object, with only the Bytes format object and charset parameter 
set (from the above brainstorming idea).  Then the real decoding could 
be done if the properties are accessed.  If the properties are not 
accessed (because the client/application makes its decisions based on 
access to other components of the email), the decoding need never be 
done for some objects.  Perhaps this would also neatly deal with my 
desire to delay the decode of MIME data parts as well?

> These bits are implementation details, though, and don't affect the API
> design.
>    

Well, one impact of the above brainstorming would be an interface to 
create the StringHeader containing the BytesHeader and charset 
parameters.  Or maybe that would be a private interface, not considered 
to be part of the API?

>>> raw_header would be the data passed in to the constructor if
>>> from_full_header is used, and None otherwise.  If encode/decode call
>>> the regular constructor, then this attribute would also act as a flag
>>> as to whether or not the header was constructed from raw input data
>>> or via program.
>>>
>>>        
>> This _implies_ that  from_full_header always accepts raw data bytes...
>> even for the StringHeader.  And that implies the need for an implicit
>> decode, and therefore, perhaps a charset parameter?  No, not a charset
>> parameter, since they are explicitly contained in the header values.
>>      
> Your confusion was my confusing use of the term 'raw data' to mean
> whatever was input to the from_full_header constructor, which is
> bytes for a BytesHeader and text for a StringHeade.
>    

If we are going to invent a new "string format raw data" element, maybe 
we should invent a term to describe it, also... maybe "raw data" should 
be split into "raw bytes" and "raw string", and "raw data" become a 
synonym for "raw bytes", as that is what it was historically?

-- 
Glenn
------------------------------------------------------------------------
“Everyone is entitled to their own opinion, but not their own facts. In 
turn, everyone is entitled to their own opinions of the facts, but not 
their own facts based on their opinions.” -- Guy Rocha, retiring NV 
state archivist