[Python-Dev] Polymorphic best practices [was: (Not) delaying the 3.2 release]

Fri Sep 17 21:25:39 CEST 2010

  On 16/09/2010 23:05, Antoine Pitrou wrote:
> On Thu, 16 Sep 2010 16:51:58 -0400
> "R. David Murray"<rdmurray at bitdance.com>  wrote:
>> What do we store in the model?  We could say that the model is always
>> text.  But then we lose information about the original bytes message,
>> and we can't reproduce it.  For various reasons (mailman being a big one),
>> this is not acceptable.  So we could say that the model is always bytes.
>> But we want access to (for example) the header values as text, so header
>> lookup should take string keys and return string values[2].
> Why can't you have both in a single class? If you create the class
> using a bytes source (a raw message sent by SMTP, for example), the
> class automatically parses and decodes it to unicode strings; if you
> create the class using an unicode source (the text body of the e-mail
> message and the list of recipients, for example), the class
> automatically creates the bytes representation.
>
I think something like this would be great for WSGI. Rather than focus 
on whether bytes *or* text should be used, use a higher level object 
that provides a bytes view, and (where possible/appropriate) a unicode 
view too.

Michael


> (of course all processing can be done lazily for performance reasons)
>
>> What about email files on disk?  They could be bytes, or they could be,
>> effectively, text (for example, utf-8 encoded).
> Such a file can be two things:
> - the raw encoding of a whole message (including headers, etc.), then
>    it should be fed as a bytes object
> - the single text body of a hypothetical message, then it should be fed
>    as a unicode object
>
> I don't see any possible middle-ground.
>
>> On disk, using utf-8,
>> one might store the text representation of the message, rather than
>> the wire-format (ASCII encoded) version.  We might want to write such
>> messages from scratch.
> But then the user knows the encoding (by "user" I mean what/whoever
> calls the email API) and mentions it to the email package.
>
> What I'm having an issue with is that you are talking about a bytes
> representation and an unicode representation of a message. But they
> aren't representations of the same things:
> - if it's a bytes representation, it will be the whole, raw message
>    including envelope / headers (also, MIME sections etc.)
> - if it's an unicode representation, it will only be a section of the
>    message decodable as such (a text/plain MIME section, for example;
>    or a decoded header value; or even a single e-mail address part of a
>    decoded header)
>
> So, there doesn't seem to be any reason for having both a BytesMessage
> and an UnicodeMessage at the same abstraction level. They are both
> representing different things at different abstraction levels. I don't
> see any potential for confusion: raw assembled e-mail message = bytes;
> decoded text section of a message = unicode.
>
> As for the problem of potential "bogus" raw e-mail data
> (e.g., undecodable headers), well, I guess the library has to make a
> choice between purity and practicality, or perhaps let the user choose
> themselves. For example, through a `strict` flag. If `strict` is true,
> raise an error as soon as a non-decodable byte appears in a header, if
> `strict` is false, decode it through a default (encoding, errors)
> convention which can be overriden by the user (a sensible possibility
> being "utf-8, surrogateescape" to allow for lossless round-tripping).
>
>> As I said above, we could insist that files on
>> disk be in wire-format, and for many applications that would work fine,
>> but I think people would get mad at us if didn't support text files[3].
> Again, this simply seems to be two different abstraction levels:
> pre-generated raw email messages including headers, or a single text
> waiting to be embedded in an actual e-mail.
>
>> Anyway, what polymorphism means in email is that if you put in bytes,
>> you get a BytesMessage, if you put in strings you get a StringMessage,
>> and if you want the other one you convert.
> And then you have two separate worlds while ultimately the same
> concepts are underlying. A library accepting BytesMessage will crash
> when a program wants to give a StringMessage and vice-versa. That
> doesn't sound very practical.
>
>> [1] Now that surrogateesscape exists, one might suppose that strings
>> could be used as an 8bit channel, but that only works if you don't need
>> to *parse* the non-ASCII data, just transmit it.
> Well, you can parse it, precisely. Not only, but it round-trips if you
> unparse it again:
>
>>>> header_bytes = b"From: bogus\xFFname<someone at python.com>"
>>>> name, value = header_bytes.decode("utf-8", "surrogateescape").split(":")
>>>> name
> 'From'
>>>> value
> ' bogus\udcffname<someone at python.com>'
>>>> "{0}:{1}".format(name, value).encode("utf-8", "surrogateescape")
> b'From: bogus\xffname<someone at python.com>'
>
>
> In the end, what I would call a polymorphic best practice is "try to
> avoid bytes/str polymorphism if your domain is well-defined
> enough" (which I admit URLs aren't necessarily; but there's no
> question a single text/XXX e-mail section is text, and a whole
> assembled e-mail message is bytes).
>
> Regards
>
> Antoine.
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk


-- 
http://www.ironpythoninaction.com/