[Web-SIG] Request for Comments on upcoming WSGI Changes

Mon Sep 21 22:15:09 CEST 2009

P.J. Eby wrote:
> At 11:23 AM 9/21/2009 -0700, Robert Brewer wrote:
> >I still don't see why the environ should have multiple versions of
> >anything. It's not as if the HTTP request gives us multiple
> >Request-URI's. There's a single processing step that has to happen
> >somewhere: decoding the bytes of the Request-URI to unicode. For the
> >vast majority of apps, it should only happen once. Twice is
> >acceptable to me for some apps. As I pointed out in the linked
> >email, doing that as soon as possible (i.e. in the WSGI origin
> >server) allows URI's to be compared as character strings more
> >easily. If you deploy a piece of middleware that transcodes (based
> >on more information than servers want to deal with), it had better
> >be nearly first in the stack so routing works reliably.
> 
> The problem with this whole approach is that it's not
> composable.  You can't stick in an application under a router that
> uses a different method for grokking its subtree of the URI space,
> unless it knows what's been done to the URI and can un-do it.

I don't understand. If SCRIPT_NAME/PATH_INFO/QUERY_STRING are unicode, the only answer to "what's been done to the URI?" can be "wsgi.uri_encoding", which allows someone to un-do it. What more do you want?

1. bytes arrive. server decodes with utf8, sets 'wsgi.uri_encoding' to 'utf-8'.
2. middleware says "oops, that's wrong". encodes back to bytes using 'utf-8', and re-decodes with koi-8, changing wsgi.uri_encoding to 'koi-8'
3. further middlewares and app use the unicode value, and don't really care what encoding was used.

> Maybe I'm missing something here, but the only way I see to preserve
> composability here is to use latin-1 or bytes.
> 
> The fundamental problem is that, like it or not, HTTP headers are
> actually byte strings.  The *only* reason we ever supported unicode
> in WSGI was to handle platforms where there's no such thing as a
> non-unicode string, and there we made it explicit that it's just a
> way of manipulating *bytes*, not unicode.
> 
> ISTM that very few (if any) of the proposals floating around for
> modifying WSGI are taking this concept into account.  Most of them
> sound to me like people saying, "yeah, but this particular hack will
> work for *my* apps...  so everybody else must be doing something
> stupid."
> 
> But WSGI was built on the principle of *equally inconveniencing
> everyone*, specifically to avoid an impossible attempt at consensus
> between incompatible ways of doing things.  (E.g., nine million
> request/response APIs.)
> 
> So, if the only problem we're going to cause by using bytes
> everywhere is to make everyone need to change their routing code on
> Python 3, I vote +1000.  ;-)

That's not the only problem. Using native strings wherever possible makes web programing in Python easier, regardless of version. In Python 3, that happens to be unicode, for good reasons.

For HTTP, there's a more specific reason: URI's should be compared for equivalence character by character, not byte by byte. See http://tools.ietf.org/html/rfc3986#section-6.2.1. That includes routing middleware.

Robert Brewer
fumanchu at aminus.org