[Web-SIG] WSGI Amendments thoughts: the horror of charsets

Fri Nov 14 18:47:50 CET 2008

Andrew Clover wrote:
> Ian Bicking wrote:
> 
>> As it is (in Python 2), you should do something like 
>> environ['PATH_INFO'].decode('utf8') and it should work.
> 
> See the test cases in my original post: this doesn't work universally. 
> On WinNT platforms PATH_INFO has already gone through a decode/encode 
> cycle which almost always irretrievably mangles the value.

This is something messed up with CGI on NT, and whatever server you are 
using, and perhaps the CGI adapter (maybe there's a way to get the raw 
environment without any encoding, for example?) -- it's mostly 
irrelevant to WSGI itself.

>> My understanding of this suggestion is that latin-1 is a way of 
>> representing bytes as unicode. In other words, the values will be 
>> unicode, but that will simply be a lie.
> 
> Yes, that would be a sensible approach, but it is not what is actually 
> happening in any WSGI environment I have tested. For example 
> wsgiref.simple_server decodes using UTF-8 not 8859-1 — or would do, if 
> it were working. (It is currently broken in 3.0rc2; I put a hack in to 
> get it running but I'm not really sure what the current status of 
> simple_server in 3.0 is.)

As far as I know, PJE just made the suggestion about Latin-1, I don't 
know if anything has actually been done in wsgiref or elsewhere to 
implement that.  Honestly I don't know if anyone is doing anything with 
WSGI and Python 3.

>> A lot of what you write about has to do with CGI, which is the only 
>> place WSGI interacts with os.environ.  CGI is really an aspect of the 
>> CGI to WSGI adapter (like wsgiref.handlers.CGIHandler), and not the 
>> WSGI spec itself.
> 
> Indeed, but we naturally have to take into account implementability on 
> CGI. If a WSGI spec *requires* PATH_INFO to have been obtained using 
> 8859-1 decoding — or UTF-8, which is the other sensible option given 
> that most URIs today are UTF-8 — then there cannot be a fully-compliant 
> CGI-to-WSGI wrapper. Perhaps it's not the big issue it was when WSGI was 
> first getting off the ground, but IMO it's still important.

This will presumably require hacks that might be system-dependent. 
Probably the current CGI adapter will just have to be a bit more 
complicated.  Also, if Python is utf8-decoding the environment, we'll 
just have to shortcut that entirely, as you can't just undo utf8.  I 
assume there is some way to get at the bytes in the environment, if not 
then that is a Python 3 bug.

>> Personally I'm more inclined to set up a policy on the WSGI server 
>> itself with respect to the encoding, and then use real unicode 
>> characters.
> 
> I think we are stuck with Unicode environ at this point, given the CGI 
> issue. But applications do need to know about the encoding in use, 
> because they will (typically) be generating their own links. So an 
> optional way to get that information to the application would be 
> advantageous.

The encoding of the operating system (which presumably informs the 
encoding of os.environ) has nothing to do with the encoding of the web 
application.  For the CGI adapter we simply need to find a way to ignore 
the system encoding.

> I'm now of the opinion that the best way to do this is to standardise 
> Apache's ‘REQUEST_URI’ as an optional environ item. This header is 
> pre-URI-decoding, containing only %-sequences and not real high bytes, 
> so it can be decoded to Unicode using any old charset without worry.

Unfortunately REQUEST_URI doesn't map directly to SCRIPT_NAME/PATH_INFO. 
  I think it might be feasible to support an encoded version of 
SCRIPT_NAME and PATH_INFO for WSGI 2.0 (creating entirely new key names, 
and I don't know of any particular standard to base those names on), 
moving from the two keys to a single REQUEST_URI is not feasible.

It's not that trivial to figure out where in REQUEST_URI the 
SCRIPT_NAME/PATH_INFO boundary really is, as there's many ways the 
unencoded values could be encoded.  I guess you'd probably count 
segments, try to catch %2f (where the segments won't match up), and then 
double check that the decoded REQUEST_URI matches SCRIPT_NAME+PATH_INFO.

> An application wanting to support Unicode URIs (or encoded slashes in 
> URIs*) could then sniff for REQUEST_URI and use it in preference to 
> PATH_INFO where available. This is a bit more work for the application, 
> but it should generally be handled transparently by a library/framework 
> and supporting PATH_INFO in a portable fashion already has warts thanks 
> to IIS's bugs, so the situation is not much worse than it already is.

I use the distinction between SCRIPT_NAME and PATH_INFO extensively. 
And frankly IIS is probably less relevant to most developers than CGI. 
Anyway, any of these bugs are things that need to be fixed in the WSGI 
adapter, we must not let them propagate into the specification or 
applications.  So if IIS has problems with PATH_INFO, the WSGI adapter 
(be it CGI or otherwise) should be configured to fix those problems up 
front.

> And of course we get support through mod_cgi and mod_wsgi automatically, 
> so Graham doesn't have to do anything. :-)
> 
> Graham Dumpleton wrote:
> 
>> I can't really remember what the outcome of the discussion was.
> 
> Not too much outcome really, unfortunately! You concluded:
> 
>> there possibly still is an open question there on how
>> encoding of non ascii characters works in practice. We just need to
>> do some actual tests to see what happens and whether there is a problem. 
> 
> ...to which the answer is — judging by the results posted — probably 
> “yes”, I'm afraid!

-- 
Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org