[Web-SIG] CGI WSGI and Unicode

Mon Dec 7 12:19:42 CET 2009

2009/12/7 Manlio Perillo <manlio_perillo at libero.it>:
> Graham Dumpleton ha scritto:
>
> Note: I'm sending the entire message to the mailing list.
>
>> 2009/12/7 Manlio Perillo <manlio_perillo at libero.it>:
>>> Hi.
>>>
>>> I'm playing with Python 3.x, current revision.
>>>
>>> I have noted that the data in the os.environ are noe Unicode strings.
>>>
>>> In a CGI application, HTTP headers are Unicode strings, and are decoded
>>> using system default encoding.
>>> In a future WSGI application, HTTP headers are Unicode strings, and are
>>> decoded using latin-1 encoding.
>>>
>>> In both cases, 'surrogateescape' is used.
>>
>> No, 'surrogateescape' is not necessary when using latin-1, or at least
>> for variables which use latin-1.
>>
>
> The problem is that not all browsers use latin-1.
> As an example with HTTP Digest authentication.

You seem to miss one important point. When converting bytes to unicode
as latin-1, the surrogate escape mechanism never comes into play. This
is because all byte values can be represented in latin-1 due it being
a single byte encoding which preserves the original bytes intact.

>> Use of 'surrogateescape' is only relevant in the context of some web
>> servers and only relevant for specific variables, some of which aren't
>> even part of set of variables which are required by WSGI.
>>
>> For example, in Apache/mod_wsgi, 'surrogateescape' is used on
>> DOCUMENT_ROOT and SCRIPT_FILENAME.
>
> What about HTTP_COOKIE?

You trimmed part of my response which is very important. For
DOCUMENT_ROOT and SCRIPT_FILENAME they must be dealt with per the
filesystem encoding and not latin-1. If you don't, the result may not
be compatible with input to file system routines in Python 3.1 which
sort of expect file system encoding plus surrogate escape.

As I say though, those variables aren't relevant to most WSGI hosting
mechanisms and even for those which the web server provides them,
nearly all WSGI applications will not care about them. In
Apache/mod_wsgi worry about them because Apache/mod_wsgi provides
features which allow one to define Apache style handlers based on file
type where the handler for the arbitrary file type is implemented as a
WSGI application. In that case the file the URL mapped to, ie.,
SCRIPT_FILENAME, is an arbitrary file and not a WSGI script file.

In the case of HTTP_COOKIE, as far as WSGI adapter goes it just
converts it to unicode as per latin-1. So, it is washing its hands of
what to do with it because it cannot know and only WSGI application
can. Because latin-1, no surrogate escape involved. In the WSGI
application where it knows what encoding may be used then the WSGI
application can convert back to bytes and to a different encoding,
using surrogate escape if it wants to to ensure no outright error if
bytes can't be represented in that alternate encoding.

>> [...]
>>> Can this cause troubles and incompatibility problems?
>>> I'm interested in special header handling, like cookies, that contain
>>> opaque data.
>>
>> The issues which CGI/WSGI bridge in Python 3.X has been discussed
>> previously on the list.
>
> It seems I missed it.
>
>> It is acknowledged that there are problems to
>> be solved there, at least to extent that CGI/WSGI bridge
>> implementation has to correct the encoding, and also that that may
>> only be solvable in Python 3.1 onwards due to not having access to
>> what encoding was use for environment variables in Python 3.0. Not
>> many people care about CGI these days and so no one has been bother to
>> come up with working CGI/WSGI bridge for Python 3.X.
>>
>
> CGI is very important; there are some kind of web applications that have
> problems when executing in a long running process.
>
> As an example, I prefer to run Trac and Mercurial instances as CGI.

Yes I agree that there are some valid uses of CGI/WSGI bridge although
those two aren't the ones I would have in mind.

For the record, CGI/WSGI adapters should also protect the original
stdin/stdout so WSGI application doesn't cause problems by using
'print' or do other odd stuff with input. I haven't seen a single
CGI/WSGI adapter which does it in a way that I would say is correct,
or at least robust against users doing stupid things, so encoding
issues aren't the only thing where CGI/WSGI adapters need work.

Graham