[Tutor] encoding question

Steven D'Aprano steve at pearwood.info
Sun Jan 5 11:55:38 CET 2014


On Sat, Jan 04, 2014 at 11:57:20PM -0800, Alex Kleider wrote:

> Well, I've tried the xml approach which seems promising but still I get 
> an encoding related error.
> Is there a bug in the xml.etree module (not very likely, me thinks) or 
> am I doing something wrong?

I'm no expert on XML, but it looks to me like it is a bug in 
ElementTree. It doesn't appear to handle unicode strings correctly 
(although perhaps it doesn't promise to).

A simple demonstration using Python 2.7:

py> import xml.etree.ElementTree as ET
py> ET.fromstring(u'<xml>a</xml>')
<Element 'xml' at 0xb7ca982c>

But:

py> ET.fromstring(u'<xml>á</xml>')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/xml/etree/ElementTree.py", line 1282, in XML
    parser.feed(text)
  File "/usr/local/lib/python2.7/xml/etree/ElementTree.py", line 1622, in feed
    self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in 
position 5: ordinal not in range(128)

An easy work-around:

py> ET.fromstring(u'<xml>á</xml>'.encode('utf-8'))
<Element 'xml' at 0xb7ca9a8c>

although, as I said, I'm no expert on XML and this may lead to errors 
later on.


> There's no denying that the whole encoding issue is still not completely 
> clear to me in spite of having devoted a lot of time to trying to grasp 
> all that's involved.

Have you read Joel On Software's explanation?

http://www.joelonsoftware.com/articles/Unicode.html

It's well worth reading. Start with that, and then ask if you have any 
further questions.


> Here's what I've got:
> 
> alex at x301:~/Python/Parse$ cat ip_xml.py
> #!/usr/bin/env python
> # -*- coding : utf -8 -*-
> # file: 'ip_xml.py'
[...]
>     tree = ET.fromstring(xml)
>     root = tree.getroot()   # Here's where it blows up!!!

I reckon that what you need is to change the first line to:

    tree = ET.fromstring(xml.encode('latin-1'))

or whatever the encoding is meant to be.


-- 
Steven


More information about the Tutor mailing list