[Python-3000] email libraries: use byte or unicode strings?

Wed Nov 5 23:45:07 CET 2008

On approximately 11/5/2008 12:38 PM, came the following characters from 
the keyboard of Barry Warsaw:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On Oct 30, 2008, at 6:17 PM, Andrew McNamara wrote:
> 
>> That's a tricker case, but I think it should use bytes internally. One of
>> the early goals of email was that be able to cope with malformed MIME -
>> this includes incorrectly encoded messages. So I think it must keep a
>> bytes representation internally.
>>
>> However - charset encoding is part of the MIME spec, so users have a
>> reasonable expectation that the mime lib will present them with unicode.
>> So the API needs to be unicode.
>>
>>> The latter doesn't though, and it needs a lot of work (we tried and 
>>> failed
>>> at pycon).
>>
>> Yes, it's hard. I think we're going to have to break the API.
> 
> I did make a start on a new API for email to work better with bytes and 
> unicode.  I didn't get that far before other work intruded.  My current 
> thinking is that you need separate APIs where appropriate to access 
> email content as unicodes (or decoded data in general).  For example, 
> normally headers and their values would be bytes, but there would be an 
> API to retrieve the decoded values as unicodes.
> 
> Similarly, where get_payload() now takes a 'decoded' option, there would 
> be a separate API for retrieving the decoded payload.  This is a bit 
> trickier because depending on the content-type, you might want a 
> unicode, or an image, or a sound file, etc.
> 
> Another tricky issue is how to set these things.  We have to get in the 
> habit of writing
> 
>     message[b'Subject'] = b'Hello'
> 
> but that's really gross, and of course email_from_string() would have to 
> become email_from_bytes().  Maybe the API accepts unicode strings but 
> only if they are ASCII?
> 
> There are lots of other problems with the email package, and while it's 
> made my life much better on the whole, it is definitely in need of 
> improvement.  Unfortunately, I don't see myself having much time to 
> attack it in the near future.  Maybe we can make it a Pycon sprint 
> (instead of spending all that time on the bzr experiment ;), or, if 
> someone else wants to lead the dirty work, I would definitely pitch in 
> with my thoughts on API and implementation.

I would find

	message[b'Subject'] = b'Hello'

to be totally gross.

While RFC Email is all ASCII, except if 8bit transfer is legal, there 
are internal encoding provided that permit the expression of Unicode in 
nearly any component of the email, except for header identifiers.  But 
there are never Unicode characters in the transfer, as they always get 
encoded (there can be UTF-8 byte sequences, of course, if 8bit transfer 
is legal; if it is not, then even UTF-8 byte sequences must be further 
encoded).

Depending on the level of email interface, there should be no interface 
that cannot be expressed in terms of Unicode, plus an encoding to use 
for the associated data.  Even 8-bit binary can be translated into a 
sequence of Unicode codepoints with the same numeric value, for example. 
  That isn't particularly, efficient, though, so providing a few 
interfaces that accept binary blobs to encode in various ways would be 
handy.  Of course binary data should allow specification of an 
associated encoding also.

I haven't looked at the details of the Python libraries yet, but it is a 
  subject I eventually want to get familiar with, as I've written Perl 
scripts to read and write email, and have tweaked a couple open source 
email clients a bit.  The Python POP, IMAP, SMTP and NNTP sound like 
they raise the level of abstraction a bit, and should make it even 
easier to read and write email.

So many projects, so many ideas, but limited time :(  Helping with this 
would be something I would really enjoy, but I'm significantly 
backlogged at present.

Maybe I should outline what would be nice to see, before delving into 
the interfaces.  This could be helpful if you invent a new interface; 
maybe some of the ideas would help avoid designs that require the above 
grossness.  Alternately, perhaps viewing these comments as an extremely 
high-level set of expectations could help view the existing interfaces 
in a way that can achieve these goals, without major rework, even if 
there are few warts.

I'll speak in terms of creating and sending a message, but receiving 
should be similar, and simpler (because encoding choices were already 
made, and only need to be decoded).

It would be nice to specify, at "message creation" time, the preferred 
types of encodings that should be used, and then not have to think about 
  the encodings any more, and just provide Unicode at the interfaces (or 
binary for certain blobs).

It is not clear that the message should be encoded on the fly, but 
rather after negotiation with the server, after determining if 8bit 
transfer is legal.

Once the message is complete, it should be retrievable as a blob 
(perhaps pickle is appropriate, or just any of the possible email 
bytestreams that could be sent) that can be re-instantiated later.

Once the message is sendable, the actual binary bytestream sent should 
be available for retrieval.  This could be used as the "sent" log, if 
one is desired (usually is for email clients).  The saved bytestream 
should be able to be used to re-instantiate an equivalent message object 
later.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking