[Web-SIG] WSGI Amendments thoughts: the horror of charsets

Andrew Clover and-py at doxdesk.com
Mon Nov 17 18:54:24 CET 2008


Mark Hammond wrote:

> I don't think Python explicitly converts it - the CRT's ANSI version
> of environ is used

Yes, it would be the CRT on Python 2.x. (Python 3.0 on non-NT does a 
conversion always using UTF-8, if I'm reading convertenviron right.)

> so the resulting strings should be encoded using the 'mbcs' encoding.
> What mangling do you see?

Correct, it's characters unencodable in mbcs that are lost*. mbcs is 
never equivalent to UTF-8 (which would allow us to recover characters on 
IIS) or ISO-8859 (which would allow us to receover characters on 
Apache-for-Windows) so there's always heavy lossage.

(* - replaced with ? or Windows's attempt to substitute something that 
looks vaguely like the original character.)

> win32api and ctypes would both let you call the Windows API.

Ah! I had considered the win32 extensions but it's a bit of a 
dependency... I'd forgotten that we get ctypes for free in 2.5.

So we'd be looking at:

     ctypes.windll.kernel32.GetEnvironmentVariableW(u'PATH_INFO', ...)

when CPython 2.5+/NT is detected, right? That increases the number of 
situations in which we can feasibly recover URIs that are valid UTF-8 
sequences (modulo the slash anyway). Doing the actual recovery still 
requires some server-sniffing though.

> What is IIS doing wrong here?

It's not wrong as such. There are three reasonable choices for decoding 
header values before putting them in a Unicode environment, and the CGI 
spec, as it knows nothing about Unicode environment variables, fails to 
specify which:

     1. ISO-8859-1 (which ensures bytes can be recovered)
     2. UTF-8 (since most URIs are effectively UTF-8 today)
     3. Configured system codepage (mbcs)

Apache [with mod_cgi or mod_wsgi] decides on (1). IIS tries for (2), 
falling back to (3) on invalid sequences. The text concerning Python 3.0 
in the WSGI Amendments page could be read as blessing Apache's behaviour.

However wsgiref.simple_server currently also goes for (2), although that 
probably can't be considered canonical. I'd be interested to know what 
other WSGI servers do.

-- 
And Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/


More information about the Web-SIG mailing list