[Email-SIG] fixing the current email module

Fri Oct 9 14:23:23 CEST 2009

On Oct 8, 2009, at 6:50 PM, Glenn Linderman wrote:

> On approximately 10/8/2009 4:40 AM, came the following characters  
> from the keyboard of Stephen J. Turnbull:
>> Glenn Linderman writes:
>>
>> > >  > If conversions are avoided, then octets are unlikely to be  
>> out of  > >  > range?
>> > >
>> > > Haven't looked in your spam bucket recently, I guess.  Spammers
>> > > regularly put 8 bit characters into headers (and into bodies in
>> > > messages without a Content-Type header), for one thing.
>> >  > I'm aware of that, but if conversions are not done, octets are  
>> unlikely  > to be _reported_ to be out of range....
>>
>> Conversions will eventually be done.  "Best it were done quickly."
>>
>
> Disagree.  Deferring the conversions defers failure issues to the  
> point where the code (hopefully) somewhat understands the type of  
> data being manipulated, and can then handle it appropriately.   
> Converting up front causes errors in things that may never be  
> touched or needed, so the error detection and handling is wasteful.

I'm with Stephen here.  Remember, we're saying the parser should never  
throw an exception, so any such conversion exception happens when you  
manipulate the model directly.  That /has/ to error early because  
otherwise it is impossible to debug.

> So for headers, which are supposed to be ASCII, or encoded via RFC  
> rules to ASCII (no 8-bit chars), then the discovery of an 8-bit char  
> should be produce a defect report, but then simply converted to  
> Unicode as if it were Latin-1 (since there is no other knowledge  
> available that could produce a better conversion).  And if the  
> result of that is not expected by the client (your definition), then  
> the client should either notice the defect report and reject it  
> based on that, or attempt to parse it, and reject it if it  
> encounters unexpected syntax.  After all, this is, for that client,  
> "raw user input" (albeit from a remote source) so fully error  
> checking the input is appropriate.

Sure, but I can also think of lots of other things the client might  
do, including blowing away the header value and substituting their  
own, doing the moral equivalent of a str.replace(), etc. etc.  It's  
not our job to decide.  It our job to provide the highest fidelity  
information we can and the best APIs for clients to do what they want.

> The problem with the APIs that are spelled __str__ and __bytes__ is  
> that there is no other way to return errors other than  
> exceptions.... the Python way.  Since the email library is trying to  
> avoid raising exceptions in large blocks of its code, it is non- 
> Pythonic (which is what Oleg is probably complaining about, in  
> part).  But because it needs to avoid exceptions, and is therefore  
> non-Pythonic, it may be inappropriate to spell very many of its APIs  
> __str__ and __bytes__, because that is Pythonic, and requires  
> exceptions.  Once you become non-Pythonic in one area, you may have  
> to also be non-Pythonic in some other areas...

As was pointed out in a previous message, we shouldn't be too  
concerned with __str__ and __bytes__ right now.  We'll design non- 
magical APIs for everything and they'll do the right thing.  We'll  
then alias what seems appropriate as __str__ and __bytes__ and they'll  
be as Pythonic as makes sense.  When I say that, I'm thinking about  
the semantic differences Message objects currently have in their dict- 
like-plus API (which I still think makes perfect practical sense).

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091009/d6f4444b/attachment-0001.pgp>