[Web-SIG] WSGI 2

Ian Bicking ianb at colorstudy.com
Wed Aug 12 06:42:51 CEST 2009


On Tue, Aug 11, 2009 at 11:19 PM, Robert Brewer <fumanchu at aminus.org> wrote:

>  > 5. When running under Python 3, servers MUST provide CGI HTTP and
> > server variables as strings. Where such values are sourced from a byte
> > string, be that a Python byte string or C string, they should be
> > converted as 'UTF-8'. If a specific web server infrastructure is able
> > to support different encodings, then the WSGI adapter MAY provide a
> > way for a user of the WSGI adapter to customise on a global basis, or
> > on a per value basis what encoding is used, but this is entirely
> > optional. Note that there is no requirement to deal with RFC 2047.
>
> We're passing unicode for almost everything.
>
> REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and
> must be ascii-decodable. So are SERVER_PROTOCOL and our custom
> ACTUAL_SERVER_PROTOCOL entries.
>
> The original bytes of the Request-URI are stored in REQUEST_URI. However,
> PATH_INFO and QUERY_STRING are parsed from it, and decoded via a
> configurable charset, defaulting to UTF-8. If the path cannot be decoded
> with that charset, ISO-8859-1 is tried. Whichever is successful is stored at
> environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if
> needed. Our origin server always sets SCRIPT_NAME to '', but if we populated
> it, we would make it decoded by the same charset.
>

My understanding is that PATH_INFO *should* be UTF-8 regardless of what
encoding a page might be in.  At least that's what I got when testing
Firefox.  It might not be valid UTF-8 if it was manually constructed, but
then there's little reason to think it is valid anything; only the bytes or
REQUEST_URI are likely to be an accurate representation.  (Frankly I wish
PATH_INFO was not url-decoded, which would remove this issue entirely --
REQUEST_URI, or any url-encoded value, should really be ASCII, and I don't
know of reasonable cases where this wouldn't be true.)

I suppose ISO-8859-1 is a reasonable fallback in this case, as it can be
used to kind of reconstruct the original request path (the surrogateescape
or whatever it is called would serve the same purpose, but is only available
in Python 3).

-- 
Ian Bicking  |  http://blog.ianbicking.org  |
http://topplabs.org/civichacker
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/web-sig/attachments/20090811/0258f954/attachment.htm>


More information about the Web-SIG mailing list