[Web-SIG] WSGI for Python 3

Ian Bicking ianb at colorstudy.com
Fri Jul 16 21:28:25 CEST 2010


On Fri, Jul 16, 2010 at 1:40 PM, P.J. Eby <pje at telecommunity.com> wrote:

> At 11:07 AM 7/16/2010 -0500, Ian Bicking wrote:
>
>> And this doesn't help with Python 3: either we have byte values of
>> SCRIPT_NAME and PATH_INFO in Python 3, or we have text values.  I think
>> bytes will be more awkward to port to than text, and inconsistent with other
>> WSGI values.
>>
>
> OTOH, it has the tremendous advantage of pushing the encoding question onto
> the app (or framework) developer...  who's really the only one who can make
> the right decision for their particular application.  And personally, I'd
> rather have clear boundaries between text and bytes, such that porting (even
> if tedious or awkward) is *consistent*, and clear as to when you're
> finished, not, "oh, did I check to make sure I converted SCRIPT_NAME and
> PATH_INFO...  not just in my app code, but in all the library code I call
> *from* my app?"
>
> IOW, the bytes/string discussion on Python-dev has kind of led me to
> realize that we might just as well make the *entire* stack bytes (incoming
> and outgoing headers *and* streams), and rewrite that bit in PEP 333 about
> using str on "Python 3000" to say we go with bytes on Python 3+ for
> everything that's a str in today's WSGI.
>

This was my first intuition too, until I started thinking in more detail
about the particular values involved.  Some obviously are textish, like
environ['SERVER_NAME'].  Not a very useful value, but definitely text.

Basically all the internal strings are textish, so we're left with:

wsgi.url_scheme
SCRIPT_NAME/PATH_INFO
QUERY_STRING
HTTP_*, CONTENT_TYPE, CONTENT_LENGTH (headers)
response status
response headers (name and value)

And there's a few things like REMOTE_USER that are kind of in the middle.
Everyone is in agreement that bodies should be bytes.

One initial problem is that the Python 3 stdlib handles bytes poorly, so for
instance there's no good way to reconstruct the URL using the stdlib.  That
explains certain tensions, but I think we should ignore that, and in fact
that's what Python-Dev seemed to say pretty clearly.

Now, the other keys:

wsgi.url_scheme: clearly ASCII

SCRIPT_NAME/PATH_INFO: often UTF-8, could be no encoding, could be some old
legacy encoding.
raw request path: should be ASCII (non-ASCII should be URL-encoded).  URL
encoding happens at the byte layer, so a server could reasonably URL encode
any non-ASCII characters without imposing any encoding.

QUERY_STRING: should be ASCII, same as raw request path

headers: Most are ASCII.  Latin1 is a reasonable fallback and suggested by
the specification.  The spec also implies you have use the RFC2047 inline
encoding (like ?iso-8859-1?q?some=20text?=), but nothing supports this and
supporting it would probably be a bad idea for security reasons.  The
Atompub spec (reasonably modern) specifically says Title headers should be
encoded with RFC2047 (if they are not ISO-8859-1):
http://tools.ietf.org/html/draft-ietf-atompub-protocol-08#page-17 --
decoding this kind of encoding at the application layer seems reasonable to
me.

cookie header: this specific header can easily have multiple encodings, as
the browser encodes data then treats it as opaque bytes, so a cookie can be
set via UTF-8 one place, Latin1 another, and those coexist in one header.
That is, there is no real encoding and this should be treated as bytes.
(Latin1 is an approximation of bytes... a spotty way to treat bytes, but
entirely workable.)

response status: I believe the spec says this must be Latin1/ISO-8859-1.  In
practice it is almost always ASCII, and since it is not user-visible it's
not something that really needs localization.

response headers: the spec implies Latin1, in practice the Set-Cookie header
is bytes (since interoperation with wonky legacy systems is not uncommon).
I'm not sure of any other exceptions?


So... to me it seems pretty reasonable for HTTP specifically that text can
work.  And if feels weird that, say, environ['SERVER_NAME'] be text and
environ['HTTP_HOST'] not, and I don't know what environ['REMOTE_ADDR']
should be in that mode.  And it would also be weird if
environ['SERVER_NAME'] was bytes.

In the past when we've gotten down to specifics, the only holdup has been
SCRIPT_NAME/PATH_INFO, hence my suggestion to eliminate those.

-- 
Ian Bicking  |  http://blog.ianbicking.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/web-sig/attachments/20100716/2b835687/attachment-0001.html>


More information about the Web-SIG mailing list