[Python-3000] email libraries: use byte or unicode strings?

Fri Nov 7 05:58:04 CET 2008

I think we should move the discussion of the pragmatics of the email
module to the email-sig list (as Barry is already doing).  But this is
probably my last post in this discussion until Nov 14 or so, I'm not
sure I'll be connected while I'm in Shanghai.

Due to travel prep, I don't have time to go into detail but two comments:

 > 1b) If it is returned as Latin-1 decoded Unicode, then once the proper 
 > encoding is intuited, the Unicode data can be reencoded as bytes
 > using Latin-1 (this is a fully reversible, no data loss
 > reencoding), and then decoded properly into Unicode.

This is true, as written.  But it's an answer to the wrong question.
What to *do with* broken data is the app's decision, it's the app's
responsibility.  IMO the *email* module's responsibility is to inform
the app that it couldn't decode in a conforming way, and provide the
raw data in case the app thinks it can do better.

Barry says that it's desirable that the parser *not* raise exceptions.
In that case, returning bytes where unicodes are expected is a way to
accomplish all the desiderata.

 > I've now looked briefly at the email module APIs.  They seem quite 
 > flexible to me.  I don't know what happens under the covers.  It seems 
 > that the API is already set up flexibly enough to handle both bytes and 
 > Unicode!!!  Perhaps it is just the implementation that should be 
 > adjusted.

Well, yes, that's what everybody is hoping for.  I agree with Barry's
assessment that at most minor, backward compatible changes should do
the trick, so that including email in Python 3.0 is OK IMO.  However,
Barry has already said that he has looked at trying to fix some of the
known issues, and he's not sure it can be done without an API break.