[Web-SIG] WSGI for Python 3

Sat Jul 17 01:20:50 CEST 2010

On Fri, 2010-07-16 at 17:11 -0500, Ian Bicking wrote:
> On Fri, Jul 16, 2010 at 5:08 PM, Chris McDonough <chrism at plope.com>
> wrote:
>         On Fri, 2010-07-16 at 17:47 -0400, Tres Seaver wrote:
>         
>         > > In the past when we've gotten down to specifics, the only
>         holdup has been
>         > > SCRIPT_NAME/PATH_INFO, hence my suggestion to eliminate
>         those.
>         >
>         > I think I favor PJE's suggestion:  let WSGI deal only in
>         bytes.
>         
>         
>         I'd prefer that WSGI 2 was defined in terms of a "bytes with
>         benefits"
>         type (Python 2's ``str`` with an optional encoding attribute
>         as a hint
>         for cast to unicode str) instead of Python 3-style bytes.
>         
>         But if I had to make the Hobson's choice between Python 3
>         style bytes
>         and Python 3 style str, I'd choose bytes.  If I then needed to
>         write
>         middleware or applications, I'd use WebOb or an equivalent
>         library to
>         enable a policy which converted those bytes to strings on my
>         behalf.
>         Making it easy to write "raw" middleware or applications
>         without using
>         such a library doesn't seem as compelling a goal as being able
>         to easily
>         write one which allowed me direct control at the raw level.
> 
> What are the concrete problems you envision with text request headers,
> text (URL-quoted) path, and text response status and headers?

Documentation is the main reason.  For example, the documentation for
making sense of path_info segments in a WSGI that used unicodey-strings
would, as I understand it, read something like this:

"""
The PATH_INFO environment variable is a string.  To decode it,

- First, split it on slashes::

    segments = PATH_INFO.split('/')

- Then turn each segment into bytes::

    bytes_segments = [ bytes(x, encoding='latin-1') for x in segments ]

- Then, de-encode each segment's urlencoded portions:

    urldecoded_segments = [ urllib.unquote(x) for x in bytes_segments ]

- Then re-encode each urldecoded segment into the encoding expected
  by your application

    app_segments = [ str(x, encoding='utf-8') for x in 
                     urldecoded_segments ]

.. note:: We decode from latin-1 above because WSGI tunnels the bytes
representing the PATH_INFO by way of a string type which contains bytes
as characters.
"""

That looks pretty apologetic to me, and to be honest, I'm not even sure
it will work reliably in the face of existing/legacy applications which
have emitted URLs that are not url-encoded properly if those old URLs
need to be supported.   http://bugs.python.org/issue8136 contains a
variation on this theme.

I'd much rather say be able to say:

"""
The PATH_INFO environment variable is a ``bytes-with-benefits`` type.
To decode it:

- First, split it on slashes::

    segments = PATH_INFO.split('/')

- Then, de-encode each segment's urlencoded portions:

    urldecoded_segments = [ urllib.unquote(x) for x in segments ]

- Then re-encode each urldecoded segment into the encoding expected
  by your application

    app_segments = [ str(x, encoding='utf-8') for x in 
                     urldecoded_segments ]
"""

Let me know if I'm missing something.

- C