[Web-SIG] WSGI for Python 3

Tres Seaver tseaver at palladion.com
Fri Jul 16 23:47:40 CEST 2010


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Bicking wrote:

>> IOW, the bytes/string discussion on Python-dev has kind of led me to
>> realize that we might just as well make the *entire* stack bytes (incoming
>> and outgoing headers *and* streams), and rewrite that bit in PEP 333 about
>> using str on "Python 3000" to say we go with bytes on Python 3+ for
>> everything that's a str in today's WSGI.
>>
> 
> This was my first intuition too, until I started thinking in more detail
> about the particular values involved.  Some obviously are textish, like
> environ['SERVER_NAME'].  Not a very useful value, but definitely text.
> 
> Basically all the internal strings are textish, so we're left with:

What do you mean by "internal"?  Anything in the headers or the CGI
environment is intrinsically "bytes-ish" to me.  Do you mean that you
want application programmers to have them transparently decoded?  If so,
we can make that the responsibility of the non-middleware framework /
application.

> wsgi.url_scheme
> SCRIPT_NAME/PATH_INFO
> QUERY_STRING
> HTTP_*, CONTENT_TYPE, CONTENT_LENGTH (headers)
> response status
> response headers (name and value)
> 
> And there's a few things like REMOTE_USER that are kind of in the middle.
> Everyone is in agreement that bodies should be bytes.
> 
> One initial problem is that the Python 3 stdlib handles bytes poorly, so for
> instance there's no good way to reconstruct the URL using the stdlib.  That
> explains certain tensions, but I think we should ignore that, and in fact
> that's what Python-Dev seemed to say pretty clearly.

python-dev seems to me to be coming to the realization that they should
have tried harder to make real-world apps work before they froze their
choices.

> Now, the other keys:
> 
> wsgi.url_scheme: clearly ASCII
> 
> SCRIPT_NAME/PATH_INFO: often UTF-8, could be no encoding, could be some old
> legacy encoding.
> raw request path: should be ASCII (non-ASCII should be URL-encoded).  URL
> encoding happens at the byte layer, so a server could reasonably URL encode
> any non-ASCII characters without imposing any encoding.
> 
> QUERY_STRING: should be ASCII, same as raw request path
> 
> headers: Most are ASCII.  Latin1 is a reasonable fallback and suggested by
> the specification.  The spec also implies you have use the RFC2047 inline
> encoding (like ?iso-8859-1?q?some=20text?=), but nothing supports this and
> supporting it would probably be a bad idea for security reasons.  The
> Atompub spec (reasonably modern) specifically says Title headers should be
> encoded with RFC2047 (if they are not ISO-8859-1):
> http://tools.ietf.org/html/draft-ietf-atompub-protocol-08#page-17 --
> decoding this kind of encoding at the application layer seems reasonable to
> me.
> 
> cookie header: this specific header can easily have multiple encodings, as
> the browser encodes data then treats it as opaque bytes, so a cookie can be
> set via UTF-8 one place, Latin1 another, and those coexist in one header.
> That is, there is no real encoding and this should be treated as bytes.
> (Latin1 is an approximation of bytes... a spotty way to treat bytes, but
> entirely workable.)
> 
> response status: I believe the spec says this must be Latin1/ISO-8859-1.  In
> practice it is almost always ASCII, and since it is not user-visible it's
> not something that really needs localization.
> 
> response headers: the spec implies Latin1, in practice the Set-Cookie header
> is bytes (since interoperation with wonky legacy systems is not uncommon).
> I'm not sure of any other exceptions?
> 
> 
> So... to me it seems pretty reasonable for HTTP specifically that text can
> work.  And if feels weird that, say, environ['SERVER_NAME'] be text and
> environ['HTTP_HOST'] not, and I don't know what environ['REMOTE_ADDR']
> should be in that mode.  And it would also be weird if
> environ['SERVER_NAME'] was bytes.


> In the past when we've gotten down to specifics, the only holdup has been
> SCRIPT_NAME/PATH_INFO, hence my suggestion to eliminate those.

I think I favor PJE's suggestion:  let WSGI deal only in bytes.



Tres.
- --
===================================================================
Tres Seaver          +1 540-429-0999          tseaver at palladion.com
Palladion Software   "Excellence by Design"    http://palladion.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkxA03wACgkQ+gerLs4ltQ7x0gCg03P1cT9RsJhagBERqY6SbLQ8
zu0An0T0YoFjzAb+2WjWp20DS3VeP68u
=ybUr
-----END PGP SIGNATURE-----



More information about the Web-SIG mailing list