[Web-SIG] WSGI 2

Tue Aug 4 19:30:32 CEST 2009

On Mon, Aug 3, 2009 at 11:28 PM, Graham
Dumpleton<graham.dumpleton at gmail.com> wrote:
>> Mainly I'm wondering, what should the server do in the event they receive a
>> byte string which is not valid UTF-8?  (Latin-1 doesn't have this problem,
>> since there's no such thing as an invalid Latin-1 string, at least not at
>> the encoding level.)
>
> Can you clarify. We aren't talking about request content here. The
> wsgi.input stream is still binary and up to WSGI application to decode
> how it decides it should be decoded.

You could receive something like
  GET /fran%E7ais
which if you do:
  urllib.unquote('/fran%E7ais').decode('utf8')
you will get an error.

So what should the server do?  Obviously anyone at any time can embed
<a href="/fran%E7ais"> in a document, and the browser is not going to
try to figure out that encoding, it's just going to follow that URL.

>From my testing (in
http://svn.colorstudy.com/home/ianb/wsgi-unicode-test.py) the browser
will be consistent about UTF8 when it does the encoding itself; but it
doesn't necessarily do the encoding itself.  QUERY_STRING will *not*
necessarily be UTF8, even when the path is UTF8 (but this doesn't
matter for us, because QUERY_STRING doesn't get url-decoded, so it's
just ASCII with %-encoding).

> The only related thing I can think you are talking about is the form
> target URL, which is an issue for GET and POST requests, or other
> method types, from a form.
>
>>> Also shown though that SCRIPT_NAME part has to be UTF-8
>>> and we would really be entering fantasy land if you were somehow going
>>> to cope with some different encoding for PATH_INFO and QUERY_STRING.
>>> Instead it is like the GPL, viral in nature. Use of UTF-8 in one
>>> particular area means you are effectively bound to use UTF-8
>>> everywhere else.
>>
>> I'm not clear on your logic here.  If I request foo/bar/baz (where baz
>> actually has an accent over the 'a') in latin-1 encoding, and foo/bar is the
>> script, then the (accented) baz is legitimate for pass-through to the
>> application, no?
>
> Technically, but what I am pointing out is that Apache pretty well
> says that foo/bar needs to be UTF-8. If you are going to have
> different parts of the one URL needing a different encoding to be
> understood, personally I would say you asking for trouble. So, am
> saying that UTF-8 needs to really apply more for sake of sanity and
> portability.

Apache's limitations can't be encoded into WSGI.  Yes, it won't work
with Apache (I guess, though with ProxyPass / or something, is this a
problem?) -- but the idea of mapping request paths to files has
nothing to do with WSGI.

>> I just tried testing this with Firefox and Apache, and found that you can in
>> fact pass such Latin-1 strings through to PATH_INFO, but at least in the
>> case of Firefox, you have to %-escape them.  However, they are seen by
>> Python (via os.environ) as latin-1 encoded byte strings.
>
> By using % escapes you are in practice overriding the encoding that
> the browser may be applying to URL if given raw character? What
> happens if you were to paste the accented character direct into the
> browser URL bar? Browsers I have played with would normally
> automatically translate that as UTF-8 and send it as such, with %
> encoding as necessary.

Correct; the browser encodes non-ASCII characters as UTF8, but does
not try to inspect the encoding of already %-encoded characters.

> So I guess the problem is more where URLs are already % encoded when
> coming back as href or form action because they may be in an encoding
> incompatible with UTF-8 if it were to be clicked on.
>
>>> Further example of why UTF-8 reaches into everything is mod_rewrite
>>> module for Apache. This allows you to do stuff related to SCRIPT_NAME,
>>> PATH_INFO and QUERY_STRING parts of a URL. Already shown that Apache
>>> configuration file has to be UTF-8. If URL isn't, then wouldn't be
>>> possible to perform matches against non latin-1 characters in a
>>> rewrite condition or rule. This is because your match string would be
>>> in different encoded form to that in URL and so wouldn't match.
>>
>> Note that this still doesn't have any impact on the bytes that actually
>> reach the application, which can be non-UTF8.  At minimum, the proposal is
>> underspecified as to how to handle this case, which is as trivial to
>> generate as sticking a %-escape in the PATH_INFO or QUERY_STRING portion(s)
>> of a URL.
>
> The Apache server at least will decode those % escape sequence and I
> believe it is the result of that which is used in stuff like rewrite
> rule matches, not the raw URL. The only exception would be if rewrite
> rule explicit matched against REQUEST_URI variable which still
> contains % escape sequences. So if not in UTF-8, means effectively
> that you can't then match them with Apache rewrite rules then.