[Web-SIG] CGI WSGI and Unicode

Tue Dec 8 16:27:41 CET 2009

Manlio Perillo wrote:

> In a CGI application, HTTP headers are Unicode strings, and are decoded
> using system default encoding.

> In a future WSGI application, HTTP headers are Unicode strings, and are
> decoded using latin-1 encoding.

Yes. As proposed, WSGI 1.1 would require CGI-to-WSGI handler to undo the 
decode stage caused by reading environ using the default encoding. At 
least this is now reliably possible thanks to surrogateescape.

PATH_INFO is the only really important HTTP-related environment variable 
for Unicode. Potentially SCRIPT_NAME could also be significant in 
relation to PATH_INFO. The HTTP headers don't massively matter because 
there are almost never any non-ASCII characters in them.

Previously the job of undoing an unwanted decode step was dumped on 
whatever read the PATH_INFO; usually a routing component, which would 
have to make guesses with typically poor results. The CGI adapter is in 
a much better place to do it, being closer to the server.

 > The problem is that not all browsers use latin-1.

Not WSGI's problem. WSGI will deliver bytes encoded into Unicode 
strings, not ready-to-use Unicode strings. It is up to the application 
to decide how they want to handle those bytes; maybe they want Latin-1 
and can do nothing, maybe they want to recode to UTF-8, maybe something 
else completely. No solution satisfies every app so there is always 
going to have to be a recode step somewhere.

An application that doesn't want to think about this will use a 
framework that does it for them.

 > What about HTTP_COOKIE?

For what it's worth, the choice of Latin-1 here results in the 'right' 
Unicode string for more browsers than any other potential encoding.

In any case as previously discussed, non-ASCII cookies are already 
totally broken everywhere and hence used by no-one.

-- 
And Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/