encoding="utf8" ignored when parsing XML

Skip Montanaro skip.montanaro at gmail.com
Tue Dec 27 10:05:37 EST 2016


I am trying to parse some XML which doesn't specify an encoding (Python 2.7.12 via Anaconda on RH Linux), so it barfs when it encounters non-ASCII data. No great surprise there, but I'm having trouble getting it to use another encoding. First, I tried specifying the encoding when opening the file:

f = io.open(fname, encoding="utf8")
root = xml.etree.ElementTree.parse(f).getroot()

but that had no effect. Then, when calling xml.etree.ElementTree.parse I included an XMLParser object:

parser = xml.etree.ElementTree.XMLParser(encoding="utf8")
root = xml.etree.ElementTree.parse(f, parser=parser).getroot()

Same-o, same-o:

unicode error 'ascii' codec can't encode characters in position 1113-1116: ordinal not in range(128)

So, why does it continue to insist on using an ASCII codec? My locale's preferred encoding is:

>>> locale.getpreferredencoding()
'ANSI_X3.4-1968'

which I presume is the official way to spell "ascii".

The chardetect command (part of the chardet package) tells me it looks like utf8 with high confidence:

% chardetect < ~/tmp/trash
<stdin>: utf-8 with confidence 0.99

I took a look at the code, and tracked the encoding I specified all the way down to the creation of the expat parser. What am I missing?

Skip



More information about the Python-list mailing list