[Web-SIG] WSGI Amendments thoughts: the horror of charsets

Andrew Clover and-py at doxdesk.com
Fri Nov 14 22:23:35 CET 2008


Ian Bicking wrote:

> This is something messed up with CGI on NT, and whatever server you are 
> using, and perhaps the CGI adapter (maybe there's a way to get the raw 
> environment without any encoding, for example?)

Python decodes the environ to its own copy (wrapped in os.environ) at 
interpreter startup time; there's no way to query the real ‘live’ 
environment that I know of. It'd require a C extension.

> Honestly I don't know if anyone is doing anything with 
> WSGI and Python 3.

I know Graham has done some work on mod_wsgi for 3.0, but no, I don't 
know anyone using it in anger.

Is it worth submitting patches to simple_server to make it run on 3.0? 
Is it too late to include at this stage anyway? Shipping 3.0 with a 
non-functional wsgiref is a bit embarrassing.

> I assume there is some way to get at the bytes in the environment, if not 
> then that is a Python 3 bug.

There is not, and this appears to be deliberate.

> I think it might be feasible to support an encoded version of 
> SCRIPT_NAME and PATH_INFO for WSGI 2.0 (creating entirely new key names, 
> and I don't know of any particular standard to base those names on),
> moving from the two keys to a single REQUEST_URI is not feasible.

That's certainly a possibility, but I feel it's easier to hitch a ride 
on the existing header, which despite being non-standard is still quite 
widely used.

> I guess you'd probably count segments, try to catch %2f (where the
> segments won't match up), and then double check that the decoded
> REQUEST_URI matches SCRIPT_NAME+PATH_INFO.

I'm currently testing with just the segment counting. It's only 
necessary that the segments from SCRIPT_NAME are matched and stripped, 
and those are extremely unlikely to contain ‘%2F’ because:

   - there aren't many filesystems that can accept ‘/’ as a filename
     character. RISC OS is the only one I can think of, and it by
     convention swaps ‘/’ and ‘.’ to compensate as it is, so even
     there you couldn't use ‘%2F’;
   - there aren't many webservers that can map a file or alias to a
     path containing ‘%2F’;
   - no-one wants to mount a webapp alias at such a weird name — it's
     only in the section corresponding to PATH_INFO that ‘%2F’ might
     ever be of use in practice.

In the worst case, many applications already know and can strip the URL 
at which they're mounted, but unless there's a legitimate ‘%2F’ in their 
SCRIPT_NAME it doesn't actually matter.

> frankly IIS is probably less relevant to most developers than CGI. 

Er... really?

You and I may not favour it, but it's ≈35% of the world out there, not 
something we can afford to ignore IMO.

> So if IIS has problems with PATH_INFO, the WSGI adapter 
> (be it CGI or otherwise) should be configured to fix those problems up 
> front.

What I'm saying is that neither Apache's nor IIS's behaviour can be 
considered clearly correct or wrong at this point, and there is no way a 
WSGI adapter living underneath them *can* fix up the differences.

(There is an problem with PATH_INFO that a WSGI adapter *could* clear 
up, which is that IIS makes PATH_INFO the entire path including 
SCRIPT_NAME. I'm not sure whether it's worth fixing that up in the 
adapter layer though... it's possible some frameworks are already 
dealing with it, and might even be relying on it!)

-- 
And Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/


More information about the Web-SIG mailing list