Unicode in cgi-script with apache2

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sun Aug 17 03:50:48 EDT 2014


Denis McMahon wrote:

> From your other message, the error appears to be a python error on
> reading the input file. For some reason python seems to be trying to
> interpret the file it is reading as ascii.

Oh!!! /facepalm

I think you've got it. I've been assuming the problem was on *writing* the
line. That's because the OP was insistent that the line failing was

    [quoting Dominique]
    The problem is, when python 'prints' to the apache interface, it
    translates the string to ascii.


but if you read the traceback, you're right, the problem is *reading* the
file, not printing:

[Sat Aug 16 23:12:42.158326 2014] [cgi:error] [pid 29327] [client 
119.63.193.196:11110] AH01215: Traceback (most recent call last):
[Sat Aug 16 23:12:42.158451 2014] [cgi:error] [pid 29327] [client 
119.63.193.196:11110] AH01215:   File "/var/www/cgi-python/index.html", 
line 12, in <module>
[Sat Aug 16 23:12:42.158473 2014] [cgi:error] [pid 29327] [client 
119.63.193.196:11110] AH01215:     for line in f:


That's the line which is failing, reading the file. Which is then *decoded*.
Files contain bytes, which have to be decoded into text, and the decode is
assuming ASCII:


[Sat Aug 16 23:12:42.158526 2014] [cgi:error] [pid 29327] [client 
119.63.193.196:11110] AH01215:   File 
"/usr/lib/python3.4/encodings/ascii.py", line 26, in decode
[Sat Aug 16 23:12:42.158569 2014] [cgi:error] [pid 29327] [client 
119.63.193.196:11110] AH01215:     return codecs.ascii_decode(input, 
self.errors)[0]
[Sat Aug 16 23:12:42.158663 2014] [cgi:error] [pid 29327] [client 
119.63.193.196:11110] AH01215: UnicodeDecodeError: 'ascii' codec can't 
decode byte 0xc3 in position 1791: ordinal not in range(128)


> I wonder if specifying the binary data parameter and / or utf-8 encoding
> when opening the file might help.

We don't really know what encoding the index.html file is encoded in. It
might be Latin-1, or cp-1252, or some other legacy encoding. But let's
assume it's UTF-8.

So why is Dominque's script reading it in ASCII? That's the key question. I
have a sinking feeling that Apache may be running Python as a subprocess
with the C locale, maybe. I don't know enough about cgi to be more than
just guessing.

Dominique, if you write:

f = open("/var/www/cgi-data/index.html", "r", encoding='utf-8')

the problem should go away (assuming index.html is valid UTF-8). If it
doesn't, there's a very strange bug somewhere.

Please try that, and see if it fixes the problem, or if the error goes to a
different line.


> eg:
> 
> f = open( "/var/www/cgi-data/index.html", "rb" )

No, you don't want that, since then reading the file will return bytes, not
text. Although I suppose the OP might just commit to using bytes
everywhere. Yuck.

> f = open( "/var/www/cgi-data/index.html", "rb", encoding="utf-8" )

That makes no sense. If you're reading in binary mode, there's no encoding.
Every byte represents itself.

> f = open( "/var/www/cgi-data/index.html", "r", encoding="utf-8" )

That's the bunny!

If you just want to hide the problem without fixing the underlying cause,
add an argument errors="replace", which is ugly but at least lets you move
on:

py> b = "Hello ë ü world".encode('utf-8')
py> print(b.decode('ascii', errors='replace'))
Hello �� �� world



-- 
Steven




More information about the Python-list mailing list