[Web-SIG] Should PEP 3333 be Python 3-only? What about transcoding?

P.J. Eby pje at telecommunity.com
Thu Nov 4 00:19:34 CET 2010


As I've been tidying up wsgiref in the stdlib for PEP 3333, I've been 
noticing that there's a bit of an issue with the PEP as far as CGI variables.

Currently, the CGI example is the same as it is in PEP 3333, which 
means that it's correct code for Python 2.x, but wrong for 3.x due to 
the environment transcoding issue.  (See 
http://bugs.python.org/issue10155 for details.)

There are other code sample differences, too.  In effect, PEP 3333 is 
still using Python 2 code samples, because it's trying to cover every 
version of Python from 2.1 through 3.2.

Should we ditch that, and say, "hey, if you want Python 2.x code 
samples, go see PEP 333?"

That will simplify a couple of things, but still won't address the 
transcoding issue.

Specifically, the problem is that on Python 3, os.environ contains 
*unicode*, not bytes masquerading as unicode.  Unfortunately, this 
means that it very possibly contains garbage for CGI variables, as 
the web server puts bytes in the environment, then Python converts 
those bytes to unicode using the system encoding + surrogateescape.

To get back to bytes, then, we have to decode using the same 
combination, then re-encode with latin-1 to get back to a 
WSGI-compatible string.

The hitch is this: not everything in os.environ comes from an HTTP 
request, and therefore may not be decodable in such a fashion.  For 
example, if you decode TMP or HOME or even DOCUMENT_ROOT that way, 
you're going to get rubbish.

In wsgiref for the stdlib, I've used a variation of And Clover's 
patch in issue #10155 to implement something that *only* transcodes 
CGI variables that come from the web client request, but it's 
dreadfully complex.

This isn't really a problem in wsgiref, because as far as I know, 
nobody else has bothered to make another CGI WSGI runner besides the 
one in wsgiref, and the sample in the PEP.

But it is a problem for the PEP, because the complexity involved is 
high -- so high it would completely obscure the essential simplicity 
of the CGI example, if it was written in-line.

There are many possible ways to address this, but my current leaning is to:

1. Change the PEP 3333 code samples to Python 3 only, and 
backreference PEP 333 for Python 2 code samples

2. Make the CGI sample in 3333 do an indiscriminate transcode (which 
only takes a few lines) and add a note to indicate that a robust CGI 
implementation should only do it to CGI variables, suggesting the 
wsgiref.handlers.read_environ() code as an example.

Any thoughts?




More information about the Web-SIG mailing list