[Email-SIG] fixing the current email module

Sat Oct 10 05:46:23 CEST 2009

On approximately 10/9/2009 6:25 PM, came the following characters from 
the keyboard of R. David Murray:
> On Fri, 9 Oct 2009 at 17:54, Glenn Linderman wrote:
>> On approximately 10/9/2009 4:20 PM, came the following characters 
>> from the keyboard of R. David Murray:
>>>  On Fri, 9 Oct 2009 at 13:26, Glenn Linderman wrote:
>>> >  On approximately 10/9/2009 8:10 AM, came the following characters 
>>> from >  the keyboard of Stephen J. Turnbull:
>>> > >   Glenn Linderman writes:
>>> > > > > >   produce a defect report, but then simply converted to 
>>> Unicode > >  as if > > >  it were Latin-1 (since there is no other 
>>> knowledge > >  available that > > >  could produce a better 
>>> conversion).
>>> > > > > > >   No, that is already corruption.  Most clients will 
>>> assume > >  that string
>>> > > > >   is valid as a header, because it's valid as a string.
>>> > > > >   Sure it is corruption.  That's why there is a defect 
>>> report.  But
>>> > > >   the conversion technique is appropriate, per the Postel 
>>> principle.
>>> > > > >   Actually, I would say you are emitting leniently, in 
>>> violation of the
>>> > >   Postel principle. > >  You can say that, but I don't have to 
>>> believe it.  I'm talking about >  accepting; the message has 
>>> arrived, it is here, the client is trying to >  look at it, and I'm 
>>> talking about ways the client can look at >  not-quite-perfect data, 
>>> knowing that it is not quite perfect, but still >  being able to see 
>>> it. I'm not at all talking about emitting data.  You >  seem to be 
>>> calling the email package helping the client to accept >  
>>> not-quite-perfect data, as a form of emitting data.  It is not.
>>>
>>>  IMO, the appropriate way for the email package to provide the API you
>>>  are talking about is it provide the client with a way to get at the 
>>> raw
>>>  byte string, which I think everyone agrees on.  If the client wants to
>>>  decode it as if it were latin-1 to process it, it can then do that. 
>>
>> That certainly works, but it isn't very helpful... that forces the 
>> client application to reproduce the logic to parse the header value 
>> and decode the parts that can be decoded successfully, and that is 
>> exactly the sort of thing Stephen was complaining about when he 
>> thought I was suggesting that to be a requirement (but he was 
>> confused about what I was suggesting).
>
> I wasn't clear, sorry :).  The current API has a 'decode_header' 
> function,
> which doesn't do the byte-to-unicode decode (yeah, there's another naming
> problem here...we have two types of decoding and only one word for both)
> but instead returns (bytes, charset) tuples.  This piece of the API is
> broken in python3, and I don't think it is the right API going forward,
> but that _kind_ of API is what I meant by 'getting at the raw byte
> string':  the byte string that failed the bytes-to-unicode decoding,
> not the entire header (though there will also be a way to get that if
> you need it, I presume.) 

Yeah, that'd be better. 

Of course, when returning Unicode strings, there would be no particular 
need to identify the various charsets in which the header was 
transmitted, except for invertibility and error handling, unless the 
client wanted to track that for some reason. 
If the goal is to preserve invertibility, then maybe tuples like (str, 
charset, defect) would be better.... where defect would be None for good 
data, but if defect were "non-ASCII", then you'd know the str was 
converted as if it were charset [Latin-1 in my book, but if  email 
package had rules or the API had parameters for how to deal with 
non-ASCII stuff, some other charset could be specified, perhaps, but if 
that fails it might still have to fall back to Latin-1]; if defect were 
"ASCII", then you'd know that the str looked like an encoded word, but 
couldn't be decoded because the charset wasn't recognized, or the 
decoding via that charset failed, so the encoded word was supplied.

Correspondingly, a header value could be set by supplying such a list, 
even with defect values as described above, to permit invertibility, and 
passing on what was obtained, so that if there are overriding local 
conventions (yep, such things used to be used, and maybe still are in 
some areas), that the data would be preserved as best as possible, and 
so that the email package could support creation of messages according 
to the local conventions.

I'd hope that a separate tuple would be used for each encoded-word, or, 
if charset ASCII and defect None, then it would describe a run of ASCII 
between encoded words.  Yes, an encoded word can be encoded in ASCII for 
rare use (if the input word looks like an encoded word), so that would 
cause a sequence of charset ASCII, defect None tuples, but otherwise a 
plain ASCII header value would have a single entry in the list of tuples.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking