[Web-SIG] Request for Comments on upcoming WSGI Changes

P.J. Eby pje at telecommunity.com
Mon Sep 21 23:24:13 CEST 2009


At 01:15 PM 9/21/2009 -0700, Robert Brewer wrote:
>I don't understand. If SCRIPT_NAME/PATH_INFO/QUERY_STRING are 
>unicode, the only answer to "what's been done to the URI?" can be 
>"wsgi.uri_encoding", which allows someone to un-do it. What more do you want?

To be sure that there's no possible way for all the broken middleware 
out there to mess this up.

Let me put it this way: out of all the times I've seen people post 
example WSGI 1 middleware code, I don't remember *any* where the 
middleware was actually complying with the spec correctly...  and 
that includes examples I wrote myself.  So I'm not real impressed 
with any solution that requires middleware to get it right.

That having been said, I'm beginning to think that PEP 383 
(surrogateescape) is actually the way to go, now that I've looked 
over the PEP, docs, and Ian's posts here about it.

First, it's compatible with CGI (os.environ) right off the bat, as 
well as being the standard way to handle this sort of issue in Python 3.

Second, it's redundancy-free: you don't need a separate environ key 
to know what's going on.

Third, it's unconditional: if you want bytes or a non-UTF-8 encoding 
you perform the same steps every time.

Up until now, I've not paid much attention because so many people 
kept saying you can't get surrogateescape on Python 2.  However, 
that's only an issue for code that *needs the original byte string*, 
as the old codec error handler API is sufficient for doing 
decoding.  (Meaning you could register a handler for it on older Pythons.)

I think this approach would let us have our cake and eat it too, for 
the most part.  WSGI on Python 2.x uses byte strings for these, and 
then 3.x works transparently.  It's a bit of a stretch to call it a 
"clarification" of WSGI 1.0, but since for all intents and purposes 
WSGI doesn't really *run* on Python 3, it might be the way to go.

To be clear, I'm talking about simply allowing (on Python 3 and in 
WSGI versions>1.0) for all environ values to be utf-8-decoded, 
surrogate-escaped unicode values, in the "native string" case.  (This 
would further imply that a CGI gateway would have to check whether 
the system encoding is UTF-8, and if not, transcode accordingly.)



More information about the Web-SIG mailing list