[Web-SIG] Request for Comments on upcoming WSGI Changes

Mon Sep 21 20:23:47 CEST 2009

René Dudfield wrote:
> On Mon, Sep 21, 2009 at 6:05 PM, Robert Brewer <fumanchu at aminus.org>
> wrote:
> > Armin Ronacher wrote:
> >> WSGI will demand UTF-8 URLs and only
> >> provide iso-XXX support for backwards compatibility.
> >
> > WSGI cannot demand that; a recommendation for utf-8 in a few draft
> > specifications is at least a decade removed from ubiquitous
> > implementation. We can default to utf-8 at best. I discussed this at
> > length in
> > http://mail.python.org/pipermail/web-sig/2009-August/003948.html
> >
> 
> that post does have good arguments why "a single encoding is not
> acceptable".  utf-8 seems the most common at this point to be the
> default... but we do need a way to specify encoding.
> 
> Is that what you're saying Robert?  Do you have a suggestion for
> specifying encodings?

CherryPy 3.2 does this (pseudocode):

    try:
        decode_uri(userdefault or 'utf-8')
    except UnicodeDecodeError:
        decode_uri('iso-8859-1')

> I think surrogateescape will handle the issues with allowing bytes to
> be stored in utf-8.
>     http://www.python.org/dev/peps/pep-0383/
> 
> However, I think that is only implemented in python 3.1?... but maybe
> there is someway to have it work on other pythons too?

As Henry Prêcheur says, "that's not an issue if the 'new' WSGI sticks to native strings." Which I'd be happy with.

> How about...
> 
> Being able to request which encoding you want has the benefit of only
> having to store one representation before 'baking' the result into the
> environ.  So if someone only ever wants utf-8 they can get it...
> however if they choose to 'bake' the environ then they can request
> something else.  This is similar to a per server setting, but I think
> should work with middleware too?

As noted above, it *is* a per-server setting in CherryPy 3.2. And any middleware can certainly be configured as its authors see fit; I don't see a need for a generic mechanism to specify what encodings middleware should try. However, we still need a generic mechanism declaring which encoding was successfully used; this is 'wsgi.uri_encoding'.

> As multiple things should be
> available, and if baked middleware (if it wants to modify things, will
> need to change each version of things).
> 
> These 'baking' methods could live in wsgi to simplify modifying the
> environs multiple versions of things. It would just have some get/set
> functions to put correct handling of encodings in one place.  Of
> course middleware is still free to change things as it wants.

I still don't see why the environ should have multiple versions of anything. It's not as if the HTTP request gives us multiple Request-URI's. There's a single processing step that has to happen somewhere: decoding the bytes of the Request-URI to unicode. For the vast majority of apps, it should only happen once. Twice is acceptable to me for some apps. As I pointed out in the linked email, doing that as soon as possible (i.e. in the WSGI origin server) allows URI's to be compared as character strings more easily. If you deploy a piece of middleware that transcodes (based on more information than servers want to deal with), it had better be nearly first in the stack so routing works reliably.

Robert Brewer
fumanchu at aminus.org