[Web-SIG] Proposal to remove SCRIPT_NAME/PATH_INFO

Ian Bicking ianb at colorstudy.com
Wed Sep 23 04:22:48 CEST 2009


OK, I mentioned this in the last thread, but... I can't keep up with all
this discussion, and I bet you can't either.

So, here's a rough proposal for WSGI and unicode:

I propose we switch primarily to "native" strings: str on both Python 2 and
3.

Specifically:

environ keys: native
environ CGI values: native
wsgi.* (that is text): native
response status: native
response headers: native

wsgi.input remains byte-oriented, as does the response app_iter.

I then propose that we eliminate SCRIPT_NAME and PATH_INFO.  Instead we
have:

wsgi.script_name
wsgi.path_info (I'm not entirely set on these names)

These both form the original path.  It is not URL decoded, so it should be
ASCII.  (I believe non-ASCII could be rejected by the server, with Bad
Request?  A server could also choose to treat it as UTF8 or Latin1 and
encode unsafe characters to make it ASCII)  Thus to re-form the URL, you do:

environ['wsgi.url_scheme'] + '://' + environ['HTTP_HOST'] +
environ['wsgi.script_name'] + environ['wsgi.path_info'] + '?' +
environ['QUERY_STRING']

All incoming headers will be treated as Latin1.  If an application suspects
another encoding, it is up to the application to transcode the header into
another encoding.  The transcoded value should not be put into the environ.
In most cases headers should be ASCII, and Latin1 is simply a fallback that
allows all bytes to be represented in both Python 2 and 3.

Similarly all outgoing headers will be Latin1.  Thus if you (against good
sense) decide to put UTF8 into a cookie, you can do:

headers.append(('Set-Cookie', unicode_text.encode('UTF8').decode('latin1')))

The server will then decode the text as latin1, sending the UTF8 bytes.
This is lame, but non-ASCII in headers is lame.  It would be preferable to
do:

headers.append(('Set-Cookie', urllib.quote(unicode_text.encode('UTF8'))))

This sends different text, but is highly preferable.  If you wanted to parse
a cookie that was set as UTF8, you'd do:

parse_cookie(environ['HTTP_COOKIE'].encode('latin1').decode('utf8'))

Again, it would be better to do;

parse_cookie(urllib.unquote(environ['HTTP_COOKIE']).decode('utf8'))

Other variables like environ['wsgi.url_scheme'], environ['CONTENT_TYPE'],
etc, will be native strings.  A Python 3 hello work app will then look like:

def hello_world(environ):
    return ('200 OK', [('Content-type', 'text/html; charset=utf8')], ['Hello
World!'.encode('utf8')])

start_response and changes to wsgi.input are incidental to what I'm
proposing here (except that wsgi.input will be bytes); we can decide about
themseparately.



Outstanding issues:

Well, the biggie: is it right to use native strings for the environ values,
and response status/headers?  Specifically, tricks like the latin1
transcoding won't work in Python 2, but will in Python 3.  Is this weird?
Or just something you have to think about when using the two Python
versions?

What happens if you give unicode text in the response headers that cannot be
encoded as Latin1?

Should some things specifically be ASCII?  E.g., status.

Should some things be unicode on Python 2?

Is there a common case here that would be inefficient?



-- 
Ian Bicking  |  http://blog.ianbicking.org  |
http://topplabs.org/civichacker
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/web-sig/attachments/20090922/7a2c83cf/attachment-0001.htm>


More information about the Web-SIG mailing list