codecs.StreamRecoder not doing what I expected.

Sun Dec 13 05:23:03 EST 2015

D'Arcy J.M. Cain wrote:

> On Sat, 12 Dec 2015 21:35:36 +0100
> Peter Otten <__peter__ at web.de> wrote:
>> def read_file(filename):
>>     for encoding in ["utf-8", "iso-8859-1"]:
>>         try:
>>             with open(filename, encoding=encoding) as f:
>>                 return f.read()
>>         except UnicodeDecodeError:
>>             pass
>>     raise AssertionError("unreachable")
> 
> I replaced this in my test and it works.  However, I still have a
> problem with my actual code.  The point of this code was that I expect
> all the files that I am reading to be either ASCII, UTF-8 or LATIN-1
> and I want to normalize my input.  My problem may actually be elsewhere.
> 
> My application is a web page of my wife's recipes.  She has hundreds of
> files with a recipe in each one.  Often she simply typed them in but
> sometimes she cuts and pastes from another source and gets non-ASCII
> characters.  So far they seem to fit in the three categories above.
> 
> I added test prints to sys.stderr so that I can see what is happening.
> In one particular case I have this "73 61 75 74 c3 a9" in the file.
> When I open the file with
> "open(filename, "r", encoding="utf-8").read()" I get what appears to be
> a latin-1 string.  

No, you get unicode. The escape code for the 
'LATIN SMALL LETTER E WITH ACUTE' codepoint just happens to be the same as 
its latin-1 value:

>>> print(ascii("é"))
'\xe9'
>>> print("é".encode("latin1"))
b'\xe9'

> I print it to stderr and view it in the web log.
> The above string prints as "saut\xe9".  The last is four actual
> characters in the file.
> 
> When I try to print it to the web page it fails because the \xe9
> character is not valid ASCII.  

Can you give some code that reproduces the error? What is the traceback?

> However, my default encoding is utf-8.

That doesn't matter. sys.stout.encoding/sys.stderr.encoding are relevant.

> Other web pages on the same server display fine.
> 
> I have the following in the Apache config by the way.
> 
> SetEnv PYTHONIOENCODING utf8
> 
> So, my file is utf-8, I am reading it as utf-8, my Apache server output
> is set to utf-8.  How is ASCII sneaking in?

I don't know. Have you verified that python "sees" the setting, e. g. with

import os
import sys
ioencoding = os.environ.get("PYTHONIOENCODING")
assert ioencoding == "utf8"
assert sys.stdout.encoding == ioencoding
assert sys.stderr.endoding == ioencoding

Have you tried setting LANG as Oscar suggested in the other thread?