[Web-SIG] Python 3.0 and WSGI 1.0.

Fri May 8 19:37:10 CEST 2009

P.J. Eby wrote:
> At 08:07 AM 5/8/2009 -0700, Robert Brewer wrote:
>> I decided that that single type should be byte strings because I want
>> WSGI middleware and applications to be able to choose what encoding
>> their output is. Passing unicode to the server would require some
>> out-of-band method of telling the server which encoding to use per
>> response, which seemed unacceptable.
> 
> I find the above baffling, since PEP 333 explicitly states that
> when using unicode types, they're not actually supposed to *be*
> unicode -- they're just bytes decoded with latin-1.

It also explicitly states that "HTTP does not directly support Unicode,
and neither does this interface. All encoding/decoding must be handled
by the application; all strings passed to or from the server must be
standard Python BYTE STRINGS (emphasis mine), not Unicode objects. The
result of using a Unicode object where a string object is required, is
undefined."

PEP 333 is difficult to interpret because it uses the name "str"
synonymously with the concept "byte string", which Python 3000 defies. I
believe the intent was to differentiate unicode from bytes, not elevate
whatever type happens to be called "str" on your Python du jour. It was
and is a mistake to standardize on type names ("str") across platforms
and not on type behavior ("byte string").

If Python3 WSGI apps emit unicode strings (py3k type 'str'), you're
effectively saying the server will always call
"chunk.encode('latin-1')". That negates any benefit of using unicode as
the type for the response. That's not "supporting unicode"; that's using
unicode exactly as if it were an opaque byte string. That's seems silly
to me when there is a perfectly useful byte string type.

> So, the server doesn't need to know "what encoding to use" -- it's
> latin-1, plain and simple.  (And it's an error for an application to
> produce a unicode string that can't be encoded as latin-1.)
>
> To be even more specific: an application that produces strings can
> "choose what encoding to use" by encoding in it, then decoding those
> bytes via latin-1.  (This is more or less what Jython and IronPython
> users are doing already, I believe.)

That may make sense for Jython and IronPython if they truly do not have
a usable byte string type. But it doesn't make as much sense for Python3
which has a usable byte string type. My way:

    App                                Server
    ---                                ------
    bchunk = uchunk.encode('utf-8')
    yield bchunk
                                       write(bchunk)

Your way:

    App                                Server
    ---                                ------
    bchunk = uchunk.encode('utf-8')
    uchunk = chunk.decode('latin-1')
    yield uchunk
                                       bchunk = uchunk.encode('latin-1')
                                       write(bchunk)

I don't see any benefit to that.

Robert Brewer
fumanchu at aminus.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/web-sig/attachments/20090508/33f90ee5/attachment-0001.htm>