Unicode in cgi-script with apache2

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sun Aug 17 09:00:53 EDT 2014


Dominique Ramaekers wrote:

> As I suspected, if I check the used encoding in wsgi I get:
> ANSI_X3.4-1968

That's another name for ASCII.


> I found you can define the coding of the script with a special comment:
> # -*- coding: utf-8 -*-

Be careful. That just tells Python what encoding the source code file is in.
It is not used by print(), or reading/writing files, just when the compiler
reads the source code.


> Now I don't get an error but my special chars still doesn't display well.
> The script:
> # -*- coding: utf-8 -*-
> import sys
> def application(environ, start_response):
>      status = '200 OK'
>      output = 'Hello World! é ü à ũ'
>      #output = sys.getfilesystemencoding() #1
> 
>      response_headers = [('Content-type', 'text/plain'),
>                          ('Content-Length', str(len(output)))]
>      start_response(status, response_headers)
> 
>      return [output]
> 
> Gives in the browser as output:
> 
> Hello World! é ü à ũ

That looks like ordinary moji-bake. Your Python script takes the text
string 'Hello World! é ü à ũ', which in UTF-8 gives you bytes:

py> 'Hello World! é ü à ũ'.encode('utf-8')
b'Hello World! \xc3\xa9 \xc3\xbc \xc3\xa0 \xc5\xa9'

Decoding back using latin-1 gives:

py> 'Hello World! é ü à ũ'.encode('utf-8').decode('latin1')
'Hello World! é ü Ã\xa0 Å©'

which appears to be exactly what you have. Why Latin-1 instead of ASCII?
Because the process has to output *something*, and Latin-1 is sometimes
called "extended ASCII". 


I'm starting to fear a bug in Python 3.4, but since I have almost no
knowledge about wsgi and cgi, I can't be sure that this isn't just normal
expected behaviour :-(


-- 
Steven




More information about the Python-list mailing list