[Web-SIG] WSGI Amendments thoughts: the horror of charsets

Thu Nov 13 00:24:54 CET 2008

Andrew Clover wrote:
> If we could reliably read the bytes the browser sends to us in the GET 
> request that would be great, we could just decode those and be done with 
> it. Unfortunately, that's not reliable, because:
> 
> 1. thanks to an old wart in the CGI specification, %XX hex escapes are 
> decoded before the character is put into the PATH_INFO environment 
> variable;

I don't see a problem with this?  At least not a problem with respect to 
encoding.  As it is (in Python 2), you should do something like 
environ['PATH_INFO'].decode('utf8') and it should work.  It doesn't seem 
like there's any distinction between %-encoded characters and plain 
characters in this situation.

> 2. the environment variables may be stored as Unicode.
> 
> (1) on its own gives us the problem of not being able to distinguish a 
> path-separator slash from an encoded %2F; a long-known problem but not 
> one that greatly affects most people.
> 
> But combined with (2) that means some other component must choose how to 
> decode the bytes into Unicode characters. No standard currently 
> specifies what encoding to use, it is not typically configuarable, and 
> it's certainly not within reach of the WSGI application. My assumption 
> is that most applications will want to end up with UTF-8-encoded URLs; 
> other choices are certainly possible but as we move towards IRI they 
> become less likely.
> 
> 
> This situation previously affected only Windows users, because NT 
> environment variables are native Unicode. However, Python 3.0 specifies 
> all environment variable access is through a Unicode wrapper, and gives 
> no way to control how that automatic decoding is done, leaving everyone 
> in the same boat.
> 
> WSGI Amendments_1.0 includes a suggestion for Python 3.0 that environ 
> should be "decoded from the headers using HTTP standard encodings (i.e. 
> latin-1 + RFC 2047)", but unfortunately this doesn't quite work:

My understanding of this suggestion is that latin-1 is a way of 
representing bytes as unicode.  In other words, the values will be 
unicode, but that will simply be a lie.  So if you know you have UTF8 
paths, you'd do:

path_info = environ['PATH_INFO'].encode('latin-1').decode('utf8')

As far as I can tell this is simply to avoid having bytes in the 
environment, even though bytes are an accurate representation and 
unicode is not.

A lot of what you write about has to do with CGI, which is the only 
place WSGI interacts with os.environ.  CGI is really an aspect of the 
CGI to WSGI adapter (like wsgiref.handlers.CGIHandler), and not the WSGI 
spec itself.

Personally I'm more inclined to set up a policy on the WSGI server 
itself with respect to the encoding, and then use real unicode 
characters.  Unfortunately that's not as flexible as bytes, as it 
doesn't make it very easy to sniff out the encoding in 
application-specific ways, or support different encodings in different 
parts of the server (which would be useful if, for instance, you were to 
proxy applications with unknown encodings).  So... maybe that's not the 
most feasible option.  But if it's not, then I'd rather stick with bytes.

-- 
Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org