codecs.StreamRecoder not doing what I expected.
D'Arcy J.M. Cain
darcy at VybeNetworks.com
Sun Dec 13 01:35:45 EST 2015
On Sat, 12 Dec 2015 21:35:36 +0100
Peter Otten <__peter__ at web.de> wrote:
> def read_file(filename):
> for encoding in ["utf-8", "iso-8859-1"]:
> try:
> with open(filename, encoding=encoding) as f:
> return f.read()
> except UnicodeDecodeError:
> pass
> raise AssertionError("unreachable")
I replaced this in my test and it works. However, I still have a
problem with my actual code. The point of this code was that I expect
all the files that I am reading to be either ASCII, UTF-8 or LATIN-1
and I want to normalize my input. My problem may actually be elsewhere.
My application is a web page of my wife's recipes. She has hundreds of
files with a recipe in each one. Often she simply typed them in but
sometimes she cuts and pastes from another source and gets non-ASCII
characters. So far they seem to fit in the three categories above.
I added test prints to sys.stderr so that I can see what is happening.
In one particular case I have this "73 61 75 74 c3 a9" in the file.
When I open the file with
"open(filename, "r", encoding="utf-8").read()" I get what appears to be
a latin-1 string. I print it to stderr and view it in the web log.
The above string prints as "saut\xe9". The last is four actual
characters in the file.
When I try to print it to the web page it fails because the \xe9
character is not valid ASCII. However, my default encoding is utf-8.
Other web pages on the same server display fine.
I have the following in the Apache config by the way.
SetEnv PYTHONIOENCODING utf8
So, my file is utf-8, I am reading it as utf-8, my Apache server output
is set to utf-8. How is ASCII sneaking in?
--
D'Arcy J.M. Cain
Vybe Networks Inc.
http://www.VybeNetworks.com/
IM:darcy at Vex.Net VoIP: sip:darcy at VybeNetworks.com
More information about the Python-list
mailing list