[Python-3000] email libraries: use byte or unicode strings?

Thu Nov 6 00:39:55 CET 2008

On approximately 11/5/2008 2:59 PM, came the following characters from 
the keyboard of Andrew McNamara:
>> I would find
>>
>> 	message[b'Subject'] = b'Hello'
>>
>> to be totally gross.
>>
>> While RFC Email is all ASCII, except if 8bit transfer is legal, there 
>> are internal encoding provided that permit the expression of Unicode in 
>> nearly any component of the email, except for header identifiers.  But 
>> there are never Unicode characters in the transfer, as they always get 
>> encoded (there can be UTF-8 byte sequences, of course, if 8bit transfer 
>> is legal; if it is not, then even UTF-8 byte sequences must be further 
>> encoded).
>>
>> Depending on the level of email interface, there should be no interface 
>> that cannot be expressed in terms of Unicode, plus an encoding to use 
>> for the associated data.  Even 8-bit binary can be translated into a 
>> sequence of Unicode codepoints with the same numeric value, for example. 
> 
> One significant problem is that the email module is intended to be
> able to work with malformed e-mail without mangling it too badly. The
> malformed e-mail should also make a round-trip through the email module
> without being further mangled.

This is an interesting perspective... "stuff em" does come to mind :)

But I'm not at all clear on what you mean by a round-trip through the 
email module.  Let me see... if you are creating an email, you (1) 
should encode it properly (2) a round-trip is mostly meaningless, unless 
you send it to yourself.  So you probably mean email that is received, 
and that you want to send on.  In this case, there is already a 
composed/encoded form of the email in hand; it could simply be sent as 
is without decoding or re-encoding.  That would be quite a clean round-trip!

> I think this requires the underlying processing to be all based on bytes,

Notice that I said _nothing_ about the underlying processing in my 
comments, only the API.  I fully agree that some, perhaps most, of the 
underlying processing has to be aware of bytes, and use and manipulate 
bytes.

> but doesn't preclude layers on top that parse the charset hints. The
> rules about encoding are strict, but not always followed. For instance,
> the headers *must* be ASCII (the header body can, however, be encoded -
> see rfc2047). 

Indeed, the headers must be ASCII, and once encoded, the header body is 
also.

> Spammers often ignore this, and you might be inclined to
> say "stuff em'", but this would make the SpamBayes authors rather unhappy.

And so it is quite possible to misinterpret the improperly encoded 
headers as 8-bit octets that correspond to Unicode codepoints (the 
so-called "Latin-1" conversion).  For spam, that is certainly good 
enough.  And roundtripping it says that if APIs are not used to change 
it, you use the original binary for that header.

> One solution is to provide two sets of classes - the underlying
> bytes-based one, and another unicode-based one, built on top of the
> bytes classes, that implements the same API, but that may fail due to
> encoding errors.

I think you meant "decoding" errors, there?

I guess I'm not terribly concerned about the readability of improperly 
encoded email messages, whether they are spam or ham.  For the purposes 
of SpamBayes (which I assume is similar to spamassassin, only written in 
Python), it doesn't matter if the data is readable, only that it is 
recognizably similar.  So a consistent mis-transliteration is as good a 
a correct decoding.

For ham, the correspondent should be informed that there are problems 
with their software, so that they can upgrade or reconfigure it.  And a 
mis-transliteration is likely the best that can be provided in that case 
anyway... unless the mail API provides for ignoring the incoming 
(incorrect or missing) encoding directives and using one provided by the 
API, and the client can select a few until they stumble on one that 
produces a readable result.  But if the mis-transliteration is done 
using the Latin-1 conversion to Unicode, the client, if it chooses to 
want to do that sort of heuristic analysis, can reencode to Latin-1, and 
then decode using some other encoding(s), independently of the mail APIs 
providing such a facility.

I do hope to learn and use the Python mail APIs, and I was hoping to do 
that in Python 3.0 (and am sorry, but not surprised, to hear that this 
is an area of problems at present), and I was hoping that the interfaces 
that would be presented by Python 3.0 mail APIs would be in terms of 
Unicode, for the convenience of being abstracted away from the plethora 
of encodings that are defined at the mail transport layer.  (Not that I 
don't understand those encodings, but it is something that certainly can 
and should be mostly hidden under the covers.)

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking