[Web-SIG] WSGI 2

Wed Aug 12 06:58:50 CEST 2009

2009/8/12 Ian Bicking <ianb at colorstudy.com>:
> On Tue, Aug 11, 2009 at 11:19 PM, Robert Brewer <fumanchu at aminus.org> wrote:
>>
>> > 5. When running under Python 3, servers MUST provide CGI HTTP and
>> > server variables as strings. Where such values are sourced from a byte
>> > string, be that a Python byte string or C string, they should be
>> > converted as 'UTF-8'. If a specific web server infrastructure is able
>> > to support different encodings, then the WSGI adapter MAY provide a
>> > way for a user of the WSGI adapter to customise on a global basis, or
>> > on a per value basis what encoding is used, but this is entirely
>> > optional. Note that there is no requirement to deal with RFC 2047.
>>
>> We're passing unicode for almost everything.
>>
>> REQUEST_METHOD and wsgi.url_scheme are parsed from the Request-Line, and
>> must be ascii-decodable. So are SERVER_PROTOCOL and our custom
>> ACTUAL_SERVER_PROTOCOL entries.
>>
>> The original bytes of the Request-URI are stored in REQUEST_URI. However,
>> PATH_INFO and QUERY_STRING are parsed from it, and decoded via a
>> configurable charset, defaulting to UTF-8. If the path cannot be decoded
>> with that charset, ISO-8859-1 is tried. Whichever is successful is stored at
>> environ['REQUEST_URI_ENCODING'] so middleware and apps can transcode if
>> needed. Our origin server always sets SCRIPT_NAME to '', but if we populated
>> it, we would make it decoded by the same charset.
>
> My understanding is that PATH_INFO *should* be UTF-8 regardless of what
> encoding a page might be in. At least that's what I got when testing
> Firefox.  It might not be valid UTF-8 if it was manually constructed, but
> then there's little reason to think it is valid anything; only the bytes or
> REQUEST_URI are likely to be an accurate representation.

As I understood it, PJE was suggesting that wasn't the case.

For example, what about case where URL appears for target of form POST
and the encoding of that form page wasn't UTF-8. What is the browser
going to send in that case.

Or is this the sort of case you have tested and qualify as saying if
manually constructed anything could happen?

> (Frankly I wish
> PATH_INFO was not url-decoded, which would remove this issue entirely --
> REQUEST_URI, or any url-encoded value, should really be ASCII, and I don't
> know of reasonable cases where this wouldn't be true.)
> I suppose ISO-8859-1 is a reasonable fallback in this case, as it can be
> used to kind of reconstruct the original request path (the surrogateescape
> or whatever it is called would serve the same purpose, but is only available
> in Python 3).

Graham