[Web-SIG] Request for Comments on upcoming WSGI Changes
P.J. Eby
pje at telecommunity.com
Mon Sep 21 23:24:13 CEST 2009
At 01:15 PM 9/21/2009 -0700, Robert Brewer wrote:
>I don't understand. If SCRIPT_NAME/PATH_INFO/QUERY_STRING are
>unicode, the only answer to "what's been done to the URI?" can be
>"wsgi.uri_encoding", which allows someone to un-do it. What more do you want?
To be sure that there's no possible way for all the broken middleware
out there to mess this up.
Let me put it this way: out of all the times I've seen people post
example WSGI 1 middleware code, I don't remember *any* where the
middleware was actually complying with the spec correctly... and
that includes examples I wrote myself. So I'm not real impressed
with any solution that requires middleware to get it right.
That having been said, I'm beginning to think that PEP 383
(surrogateescape) is actually the way to go, now that I've looked
over the PEP, docs, and Ian's posts here about it.
First, it's compatible with CGI (os.environ) right off the bat, as
well as being the standard way to handle this sort of issue in Python 3.
Second, it's redundancy-free: you don't need a separate environ key
to know what's going on.
Third, it's unconditional: if you want bytes or a non-UTF-8 encoding
you perform the same steps every time.
Up until now, I've not paid much attention because so many people
kept saying you can't get surrogateescape on Python 2. However,
that's only an issue for code that *needs the original byte string*,
as the old codec error handler API is sufficient for doing
decoding. (Meaning you could register a handler for it on older Pythons.)
I think this approach would let us have our cake and eat it too, for
the most part. WSGI on Python 2.x uses byte strings for these, and
then 3.x works transparently. It's a bit of a stretch to call it a
"clarification" of WSGI 1.0, but since for all intents and purposes
WSGI doesn't really *run* on Python 3, it might be the way to go.
To be clear, I'm talking about simply allowing (on Python 3 and in
WSGI versions>1.0) for all environ values to be utf-8-decoded,
surrogate-escaped unicode values, in the "native string" case. (This
would further imply that a CGI gateway would have to check whether
the system encoding is UTF-8, and if not, transcode accordingly.)
More information about the Web-SIG
mailing list