Unicode in cgi-script with apache2

Sun Aug 17 01:54:53 EDT 2014

Dominique Ramaekers wrote:

[...]
> 2) Your tip, to use 'encode' did not solve the problem and created a new
> one. My lines were incapsulted in quotes and I got a lot of \b's and
> \n's... and I still got the same error.

Just throwing random encode/decode calls into the mix are unlikely to fix
the problem. First, you need to find an Apache expert who can tell you what
encoding your Apache process is expecting. Hopefully it is UTF-8. Then you
need to confirm that your Python process is also using UTF-8. Nearly all
Unicode-related issues are due to mismatches between encodings in different
parts of the system. If only everyone could use UTF-8 for all storage and
transport layers, life would be so much simpler... but I digress.

[...]
> What seems to be the problem:
> My Script was ok. I know this because in the terminal I got my expected
> output. 

Did you test it at the terminal with input including ë and ü?

> Python3 uses UTF-8 coding as a standard. The problem is, when 
> python 'prints' to the apache interface, it translates the string to
> ascii. (Why, I never found an answer).

Try putting the lines:

import sys
print(sys.getfilesystemencoding())

at the start of your program, and see what it prints at the terminal and
what it prints under Apache. I predict that under Apache, it will say
something like "C locale" or "US ASCII". If so, *that* is your problem.

> Somewhere in the middle of my 
> index.html file, there are letters like ë and ü. If Python tries to
> translate these, Python throws an error. If I delete these letters in
> the file, the script works perfectly in a browser! In Python2.7 the
> script can easily be tweaked so the translation to ascii isn't done, 

Not quite. Under Python 2.7, you will likely get moji-bake. For instance, if
your index.html contains "ë ü π" stored in UTF-8, Python 2.7 will throw its
hands in the air, say "I have no idea what ASCII characters they are, let's
pretend it's some sort of Latin-1" and you'll get:

Ã« Ã¼ Ï

instead. Or perhaps not. With Python 2.7, what you get is not quite random,
but it depends on the environment in some fairly obscure ways. Python 3 at
least raises an exception when there is a mismatch, instead of trying to
guess what you get.

> but 
> in Python3, its a real pain in the a... I've read about people who
> managed to force Python3 to 'print' to apache in UTF-8, but none of
> their solutions worked for me.

There is very little point in throwing random solutions at a problem if you
don't understand the problem. First you need to find out why Python is
trying to convert to ASCII. That's probably because of something Apache is
doing. Do you have an Apache technician you can ask?

-- 
Steven