[Web-SIG] Request for Comments on upcoming WSGI Changes

Tue Sep 22 18:07:14 CEST 2009

Graham wrote:

 > Armin has fast asleep now, so my shift.

Heh. It's a multiple-man job keeping up with this monster thread!

> The URLs don't break.

Not in themselves. Just the language of the PEP implies that to fix them 
up would contravene the spec:

 >> The application MUST use [the encoding guess for PATH_INFO] to decode
 >> the ``'QUERY_STRING'`` as well.

This isn't appropriate even as a SHOULD: the guessed encoding for 
PATH_INFO is very likely to be wrong, in particular for cases where the 
path was purely ASCII.

The application (or a library/framework acting on its behalf) should be 
allowed to decode QUERY_STRING using whatever encoding it is expecting. 
Disallowing using anything other than utf-8 (and iso-8859-1 in a very 
unreliable way) makes it impossible to have queries in any other 
encoding at all and still comply with the spec, which is undesirable.

If this sentence is removed, and `wsgi.uri_encoding` is guaranteed to be 
one of:

   a. definitive and reliable, or
   b. missing/None

I'm pretty much happy. What I don't want is that half the future-WSGI 
servers/gateways decide they have to provide *some* value for 
`wsgi.uri_encoding` even if they're not quite sure if it's the right 
one. Then we're back to square one.

> if it is known that an application or some subset of
> URLs will always be receiving a request as non UTF-8, then it should
> employ code in those cases to always transcode it to the required
> encoding.

Yep, agreed. I think the PEP should clarify that; at the moment it is 
saying that a transcode is something you should only do for the 
iso-8859-1 case, but if you actually followed that advice you'd get 
highly inconsistent results. Perhaps we're at cross-purposes as to what 
exactly consistutes 'middleware'...

> The other fallback is that a specific WSGI server could elect to
> provide an option to not use 'UTF-8' as the first choice for decoding

I really, *really* hope this does not happen. That just brings us more 
deployment heartaches.

> Whether surrogateescape gives a better solution I have no idea at this
> point

Yeah... I'm highly suspicious of surrogateescape in a web context and 
personally my code will be deliberately filtering all such characters 
out. I can see it being a possible way to smuggle unwanted sequences 
(such as overlongs) through filters, potentially causing endless 
security problems. But we'll see...

-- 
And Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/