[Web-SIG] Request for Comments on upcoming WSGI Changes

Tue Sep 22 04:35:52 CEST 2009

Armin has fast asleep now, so my shift. :-)

He did point me to this specific email for closer attention,
indicating issues with QUERY_STRING and wsgi.uri_encoding due to
something mentioned here. I didn't quite get what he was talking
about, but then I believe he has some wrong statements in his PEP-XXX
about QUERY_STRING. I'll make a a few of my own comments about this
email, and then maybe those who are still awake can help in
understanding issues raised here.

2009/9/22 And Clover <and-py at doxdesk.com>:
> Armin Ronacher wrote:
>
>> The middleware can never know.
>
> It's much more likely than to know than the server though!
>
>> WSGI will demand UTF-8 URLs and only
>> provide iso-XXX support for backwards compatibility.
>
> It doesn't sound much like backwards compatibility to me if non-UTF-8 URLs
> break as soon as they coincidentally happen to be UTF-8 byte sequences. I'm
> as much an advocate of "UTF-8 for everything everywhere!" as anyone else,
> but unfortunately today there are still dark places where you need non-UTF-8
> URLs.

The URLs don't break. As mentioned elsewhere, but perhaps not overly
clear is that if it is known that an application or some subset of
URLs will always be receiving a request as non UTF-8, then it should
employ code in those cases to always transcode it to the required
encoding. Thus something like:

    import codecs
    iso_8859_7 = codecs.lookup('iso-8859-7')

    def redecode(string, encoding):
        return string.encode(encoding).decode('iso-8859-7')

    if codecs.lookup(environ['wsgi.uri_encoding']) != iso_8859_7:
        environ['PATH_INFO'] = redecode(environ['PATH_INFO'],
environ['wsgi.uri_encoding'])
        environ['SCRIPT_NAME'] = redecode(environ['SCRIPT_NAME'],
environ['wsgi.uri_encoding'])
        environ['wsgi.uri_encoding'] = 'iso-8859-7'

This could be a part of the actual application if needing to be
selective based on URLs, or as a WSGI middleware that can adjust it
and which wraps the WSGI application.

The other fallback is that a specific WSGI server could elect to
provide an option to not use 'UTF-8' as the first choice for decoding
and instead use a user supplied value via the WSGI servers
configuration. Robert already showed as pseudo code what the WSGI
server would do:

   try:
       decode_uri(userdefault or 'utf-8')
   except UnicodeDecodeError:
       decode_uri('iso-8859-1')

For a pure Python WSGI server, which effectively only supports
mounting at root of site, then this may apply to whole site. In
Apache/mod_wsgi however, where using Location directive in Apache one
can easily apply configuration to a sub set of URLs, one could be more
selective. It gets more complicated when one talks about composition
of disparate WSGI components as part of an application stack.

Now, although having the configuration be done outside of the WSGI
application and in the web server will not appeal to some, it still
may be a useful fallback for where people don't want to have to fiddle
with using WSGI middleware wrappers around their whole application or
around individual components to do it.

Anyway, there are multiple options here.

> Incidentally, if wsgi.uri_encoding is going to be the way to signal that the
> server has decoded bytes to characters using a known encoding, it should be
> stressed that this should only be set when that encoding is certain.
>
> That is, wsgi.uri_encoding should be omitted (or None?) in cases where
> another party has already decoded (and maybe mangled) the bytes using an
> unknown encoding. In particular, CGI.

Yes, it is known that CGI and Python 3.X will be a problem. There has
been a number of discussions which raised the CGI issues in the past.
This time around we were possibly ignoring it for time being so that
CGI script compatibility wasn't going to exclusively override us
trying to make something that would work sanely for more up to date
hosting methods.

So, yes, having wsgi.uri_encoding be set to None for where not able to
be determined what encoding is would be sensible. It may be the case
that in such situations the only thing people can portably rely on is
being able to use ASCII. If they know for sure what is used, they
could set wsgi.uri_encoding themselves in a WSGI middleware wrapper
around their application, or CGI/WSGI adapter could provide an option
to allow user to set it so WSGI adapter uses user value but otherwise
leaves the variables as they were.

> (In the case of Windows CGI the server will have decoded URI bytes into
> Unicode characters, using a charset which it is impossible to find out. In
> Apache it's iso-8859-1; in IIS it's UTF-8 as long as it was a valid UTF
> sequence, otherwise it's the system codepage. This problem affects the
> non-CGI implementation isapi_wsgi, too. Then the variables are read as
> environment variables, which for Python 2 means another encode/decode step
> on Windows using the system codepage, mangling non-codepage characters.
> Python 3 has the opposite problem reading byte envvars using UTF-8, which
> won't be how Apache put them there.)
>
> If wsgi.encoding is obligatory then in reality it will often be wrong,
> leaving us in the same pathetic predicament as with WSGI 1.0, where
> non-ASCII URIs don't work reliably at all.

I'll have to research more about this, or at least the claims about
Apache, as not entirely sure that is correct.

Whether surrogateescape gives a better solution I have no idea at this
point as haven't had a chance to delve in to it enough to understand
it and no one has posted a good summary of if with actual descriptive
examples of how it would work for Python 2.X/3.X. The comments about
it have all assumed to a degree that you understand what it is in the
first place, which is slightly annoying. Can someone perhaps give such
a clear description with examples or perhaps give a reference to
record in Google Groups archive where in the long email chain the
dummies guide for use of surrogateescape in WSGI was posted.

Now, Armin for some reason was concerned about QUERY_STRING and
wsgi.uri_encoding for some reason after reading your email. I'm still
not sure why.

In my original blog post I talked about QUERY_STRING being dealt with
along with SCRIPT_NAME and PATH_INFO as far as determining what
wsgi.uri_encoding would be. Armin pointed out that QUERY_STRING by
rights should only contain ASCII and so doesn't need to come into that
and could be converted straight to unicode as ASCII or possibly
ISO-8859-1 depending I think on which RFC you believe.

Even so, in PEP-XXX it says:

"""
For the keys ``SCRIPT_NAME``, ``PATH_INFO`` (and ``REQUEST_URI`` if
available but that variable will most likely only contain ASCII characters
because it is quoted) the server has to use the following algorithm for
decoding:

-   it decodes all values as `utf-8`.
-   if that fails, it decodes all values as `iso-8859-1`.

The encoding the server used to decode the
value is then stored in ``'wsgi.uri_encoding'``.  The application MUST use this
value to decode the ``'QUERY_STRING'`` as well."""

Ie., no mention of QUERY_STRING in first part, but then says that
QUERY_STRING must be decoded with that as well. To say that doesn't
seem right and in some respects QUERY_STRING can stand distinct from
SCRIPT_NAME and PATH_INFO much as no special treatment is being given
to HTTP_COOKIE and HTTP_REFERRER. If REQUEST_URI is supposed to be
ASCII as well, then shouldn't it be distinct as well.

Thus, wsgi.uri_encoding would only apply to SCRIPT_NAME and PATH_INFO.
Although, when it comes down to just these two, also perhaps read my
concerns about different encodings being applied in each as per my
original blog post.

The problem which arises is that unquoting of URLs in Python 3.X
stdlib can only be done on unicode strings. If though a string
contains non UTF-8 encoded characters it can fail.

>>> urllib.parse.parse_qsl('a=b%e0')
[('a', 'b�')]

Or at least it shoves in characters indicating not a UTF-8 character.

So, stdlib effectively forces UTF-8.

This seems to be a deficiency in Python 3.X stdlib and was something
believed we already knew about.

I think Robert said he already had some code to do this that would work.

Until Armin wakes up and explains what he who saw about QUERY_STRING
that would break wsgi.uri_encoding, maybe so one can clarify how
QUERY_STRING is going to be handled if stdlib doesn't work.

Graham