[Web-SIG] WSGI Amendments thoughts: the horror of charsets
Ian Bicking
ianb at colorstudy.com
Thu Nov 13 00:24:54 CET 2008
Andrew Clover wrote:
> If we could reliably read the bytes the browser sends to us in the GET
> request that would be great, we could just decode those and be done with
> it. Unfortunately, that's not reliable, because:
>
> 1. thanks to an old wart in the CGI specification, %XX hex escapes are
> decoded before the character is put into the PATH_INFO environment
> variable;
I don't see a problem with this? At least not a problem with respect to
encoding. As it is (in Python 2), you should do something like
environ['PATH_INFO'].decode('utf8') and it should work. It doesn't seem
like there's any distinction between %-encoded characters and plain
characters in this situation.
> 2. the environment variables may be stored as Unicode.
>
> (1) on its own gives us the problem of not being able to distinguish a
> path-separator slash from an encoded %2F; a long-known problem but not
> one that greatly affects most people.
>
> But combined with (2) that means some other component must choose how to
> decode the bytes into Unicode characters. No standard currently
> specifies what encoding to use, it is not typically configuarable, and
> it's certainly not within reach of the WSGI application. My assumption
> is that most applications will want to end up with UTF-8-encoded URLs;
> other choices are certainly possible but as we move towards IRI they
> become less likely.
>
>
> This situation previously affected only Windows users, because NT
> environment variables are native Unicode. However, Python 3.0 specifies
> all environment variable access is through a Unicode wrapper, and gives
> no way to control how that automatic decoding is done, leaving everyone
> in the same boat.
>
> WSGI Amendments_1.0 includes a suggestion for Python 3.0 that environ
> should be "decoded from the headers using HTTP standard encodings (i.e.
> latin-1 + RFC 2047)", but unfortunately this doesn't quite work:
My understanding of this suggestion is that latin-1 is a way of
representing bytes as unicode. In other words, the values will be
unicode, but that will simply be a lie. So if you know you have UTF8
paths, you'd do:
path_info = environ['PATH_INFO'].encode('latin-1').decode('utf8')
As far as I can tell this is simply to avoid having bytes in the
environment, even though bytes are an accurate representation and
unicode is not.
A lot of what you write about has to do with CGI, which is the only
place WSGI interacts with os.environ. CGI is really an aspect of the
CGI to WSGI adapter (like wsgiref.handlers.CGIHandler), and not the WSGI
spec itself.
Personally I'm more inclined to set up a policy on the WSGI server
itself with respect to the encoding, and then use real unicode
characters. Unfortunately that's not as flexible as bytes, as it
doesn't make it very easy to sniff out the encoding in
application-specific ways, or support different encodings in different
parts of the server (which would be useful if, for instance, you were to
proxy applications with unknown encodings). So... maybe that's not the
most feasible option. But if it's not, then I'd rather stick with bytes.
--
Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org
More information about the Web-SIG
mailing list