CGI and Unicode

Mon Jun 23 17:46:59 EDT 2003

Jim Hefferon wrote:
> I have been struggling with getting Unicode out of Python's cgi 
> module. I have a small script illustrating the problem at the bottom 
> but first I need to explain.

[...]

> But when I ask what is the type of the variable that I get from 
> the cgi module, it comes out as StringType, not UnicodeType.  My 
> browser is Galeon on the latest Debian and I've also tested it 
> with IE on NT.
> 
> What am I missing?  

The problem, I think, is the lack of consistency amongst browsers in
indicating the encoding of the submitted data.  For instance, when
responding to the form in your script, Opera includes a "Content-type"
header containing:

   application/x-www-form-urlencoded;charset=utf-8

whereas the "Content-type" header sent by Mozilla (and I suspect most
other browsers[0]) doesn't indicate the charset:

   application/x-www-form-urlencoded

If all browsers always included did this, then the cgi module could
reliably detect the data encoding and store the parameters as Unicode
strings when appropriate.  As it stands, there's usually insufficient
information for cgi to detect when Unicode is being sent or what the
encoding is.  If /you/ can determine by other means that the submitted
data is UTF-8 encoded (which is probably the case if the form was part
of a UTF-8 encoded document) there's nothing stopping you from
decoding it yourself (using codecs.utf_8_decode or unicode(string,
'utf-8'), for example).

Oh, one last thing (which you probably know, but just in case...): you
can access the submitted headers through the environment variables of
the CGI process.

   import os
   for key, value in os.environ.items():
      print '<p>%30s : %s</p>' % (key, value)

Hope this helps,

               Jeremy.

[0] A quick skim of rfc 1867 seems to indicate that the charset clause
isn't standard.