[Web-SIG] WSGI 2: Decoding the Request-URI

Thu Aug 20 20:03:13 CEST 2009

On Sun, Aug 16, 2009 at 08:06:03PM -0700, Robert Brewer wrote:
> However, we quite often use only a portion of the URI when attempting
> to locate an appropriate handler; sometimes just the leading "/"
> character! The remaining characters are often passed as function
> arguments to the handler, or stuck in some parameter list/dict. In
> many cases, the charset used to decode these values either: is
> unimportant; follows complex rules from one resource to another; or is
> merely reencoded, since the application really does care about bytes
> and not characters. Falling back to ISO-8859-1 (and minting a new WSGI
> environ entry to declare the charset which was used to decode) can
> handle all of these cases. Server configuration options cannot, at
> least not without their specification becoming unwieldy.

(Just to make things clear, I am not just talking about REQUEST_URI
here, but all request headers)

Encoding everything using ISO-8859-1 has the nice property of keeping
informations intact. It would be good heuristic if everything with a few
exceptions was encoded using ISO-8859-1. Just transcode the few
problematic cases at the application level and everybody is happy. A
string encoded from ISO-8859-1 is like a bytes object with a string
'interface' on top of it.

But it sweep the encoding problem under the carpet. The problem with
Python 2 was that str and unicode were almost the same, so much the same
that it was possible to mix them without too much problems:

  >>> 'foo' == u'foo'
  True

Python 3 made bytes and string 'incompatible' to force programmers to
handle the encoding problem as soon as possible:

  >>> b'foo' == 'foo'
  False

By passing `str()` to the application, the application author could
believe that the encoding problem has been handled. But in most cases it
hasn't been handled at all. The application author should still
transcode all the strings incorrectly encoded. We are back to Python 2's
bad old days, where we can't be sure that what we got is properly
encoded:

  Was that string encoded using latin-1? Maybe a middleware transcoded
  it to UTF-8 before the application was called. Maybe the application
  itself transcoded it at some point, but then we need to keep track of
  what was transcoded. Maybe the application should transcode everything
  when it is called.

Also EVERY application author will have to read the PEP, especially the
paragraph saying:

  > Everything we give you are strings, but you still have to deal
  > with the encoding mess.

Otherwise he will have weird problems like when he was using Python 2.
Because the interface is not clear. strings are supposed to be text and
only text. Encoding everything to ISO-8859-1 means strings are not text
anymore, they are 'encoded data' [1].

bytes are supposed to be 'encoded data' and binary blobs. By giving
applications bytes, the author knows right away he should decode them.
No need to read the PEP.

`bytes` can do everything `str` can do with the notable exception of
'format'.

  >>> b'foo bar'.title()
  b'Foo Bar'

  >>> b'/foo/bar/fran\xc3ois'.split(b'/')
  [b'', b'foo', b'bar', b'fran\xc3ois']

  >>> re.match(br'/bar/(\w+)/(\d+)', b'/bar/foo/1234').groups()
  (b'foo', b'1234')

I understand that `bytes()` is an unfamiliar beast. But I believe the
encoding problem is the realm of the application, not the realm of the
gateway. Let the application handle the encoding problem and don't give
it a half baked solution.

Using bytes also has its set of problems. The standard library doesn't
support bytes very well. For example urllib.response.unquote() doesn't
work with bytes, and urllib.parse too has issues.

[1] http://docs.python.org/3.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

-- 
  Henry Prêcheur