encoding="utf8" ignored when parsing XML

Tue Dec 27 20:40:32 EST 2016

On Wed, 28 Dec 2016 02:05 am, Skip Montanaro wrote:

> I am trying to parse some XML which doesn't specify an encoding (Python
> 2.7.12 via Anaconda on RH Linux), so it barfs when it encounters non-ASCII
> data. No great surprise there, but I'm having trouble getting it to use
> another encoding. First, I tried specifying the encoding when opening the
> file:
> 
> f = io.open(fname, encoding="utf8")
> root = xml.etree.ElementTree.parse(f).getroot()

The documentation for ET.parse is pretty sparse

https://docs.python.org/2/library/xml.etree.elementtree.html#xml.etree.ElementTree.parse

but we can infer that it should take bytes as argument, not Unicode, since
it does its own charset processing. (The optional parser argument takes an
encoding argument which defaults to UTF-8.)

So that means using the built-in open(), or io.open() in binary mode.

You open the file and read in bytes from disk, *decoding* those bytes into a
UTF-8 Unicode string. Then the ET parser tries to decode the Unicode string
into Unicode, which it does by first *encoding* it back to bytes using the
default encoding (namely ASCII), and that's where it blows up.

This particular error is a Python2-ism, since Python2 tries hard to let you
mix byte strings and unicode strings together, hence it will try implicitly
encoding/decoding strings to try to get them to fit together. Python3 does
not do this.

You can easily simulate this error at the REPL:

py> u"µ".decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb5' in position
0: ordinal not in range(128)

The give-away is that you're intending to do a *decode* operation but get an
*encode* error. That tells you that Python2 is trying to be helpful :-)

(Remember: Unicode strings encode to bytes, and bytes decode back to
strings.)

You're trying to read bytes from a file on disk and get Unicode strings out:

bytes in file --> XML parser --> Unicode

so that counts as a decoding operation. But you're getting an encoding
error -- that's the smoking gun that suggests a dubious Unicode->bytes
step, using the default encoding (ASCII):

bytes in file --> io.open().read() --> Unicode --> XML Parser --> decode to
bytes using ASCII --> encode back to Unicode using UTF-8

And that suggests that the fix is to open the file without any charset
processing, i.e. use the builtin open() instead of io.open().

bytes in file --> builtin open().read() --> bytes --> XML Parser --> Unicode

I think you can even skip the 'rb' mode part: the real problem is that you
must not feed a Unicode string to the XML parser.

> but that had no effect. Then, when calling xml.etree.ElementTree.parse I
> included an XMLParser object:
> 
> parser = xml.etree.ElementTree.XMLParser(encoding="utf8")
> root = xml.etree.ElementTree.parse(f, parser=parser).getroot()

That's the default, so there's no functional change here. Hence, the same
error.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.