[Web-SIG] Request for Comments on upcoming WSGI Changes

Tue Sep 22 06:09:36 CEST 2009

On Mon, Sep 21, 2009 at 07:40:54PM -0700, Robert Brewer wrote:
> The decoding doesn't change spontaneously.
> You either get the correct one or you get an incorrect one. If it's
> incorrect, you fix it, one time, via a WSGI component which you've
> configured to determine the "correct" decoding. Then every other WSGI
> component "below" that one can go back to trusting the decoding was
> correct. In fact, if you do that transcoding right away, no other WSGI
> components need to be rewritten to take advantage of unicode. You just
> have to deploy a single transcoder, that's 6 lines of code max.

And you can do that with utf8+surrogateescape too. Except that you don't
have to determine what encoding the gateway sent you, it's always
utf8+surrogateescape.

> With utf8+surrogateescape, you don't transcode once, you transcode in
> every WSGI component in your stack that needs to "correct" the
> decoding. You have to do it more than once because, each time you
> encode/re-decode, you use the result and then throw it away. Any
> subsequent WSGI components have to encode/re-decode--you cannot store
> the redecoded URI in SCRIPT_NAME/PATH_INFO, because the
> utf8+surrogateescape scheme says...well, it's always utf8-decoded.

You don't get something REALLY important with surrogateescape: You can
ALWAYS get the original bytes back.

    >>> b = b'fran\xe7cois'
    >>> s = b.decode('utf8', 'surrogateescape')
    >>> s
    'fran\udce7cois'
    >>> s.encode('utf8', 'surrogateescape')
    b'fran\xe7cois'

See? I got my latin-1 character '\xe7' back! Because '\udce7' is not a
normal UTF-8 character, this character use some 'free space' in the
unicode supplementary characters.

The only thing you have to do is to pass 'surrogateescape' each time you
call encode/decode.

> In addition, *every* component that needs to compare URI's then has to
> be configured with the same logic, however convoluted, to perform the
> "correct" decoding again. It's not just routing middleware: caches
> need to reliably compare decoded URI's; so do sessions; so does auth
> (especially!); so do static files. And Heaven forfend you actually
> decode differently in two different components!

I don't understand why I would need to throw away the decoded string.

This works perfectly well a far as I know:

    environ['PATH_INFO'] = environ['PATH_INFO'].\
          encode('utf8', 'surrogateescape').\
          decode(SITE_ENCODING)

utf8+surrogateescape provides the same possibilities as
wsgi.uri_encoding. You can transcode without losing information when you
know what the correct encoding is. But utf8+surrogateescape is simpler
because there's no need to pass around the name of the encoding in an
additional variable.

-- 
  Henry Prêcheur