iterparse and unicode

Wed Aug 20 19:08:17 EDT 2008

On Aug 21, 8:36 am, George Sakkis <george.sak... at gmail.com> wrote:
> It seems xml.etree.cElementTree.iterparse() is not unicode aware:
>
> >>> from StringIO import StringIO
> >>> from xml.etree.cElementTree import iterparse
> >>> s = u'<name>\u03a0\u03b1\u03bd\u03b1\u03b3\u03b9\u03ce\u03c4\u03b7\u03c2</name>'
> >>> for event,elem in iterparse(StringIO(s)):
>
> ...     print elem.text
> ...
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "<string>", line 64, in __iter__
> UnicodeEncodeError: 'ascii' codec can't encode characters in position
> 6-15: ordinal not in range(128)
>
> Am I using it incorrectly or it doesn't currently support unicode ?

Hi George,
I'm no XML guru by any means but as far as I understand it, you would
need to encode your text into UTF-8, and prepend something like '<?xml
version="1.0" encoding="UTF-8" standalone="yes"?>' to it. This appears
to be the way XML is, rather than an ElementTree problem.

E.g.

>>> from StringIO import StringIO
>>> from xml.etree.cElementTree import iterparse
>>> s = u'<wrapper><name>\u03a0\u03b1</name><digits>01234567</digits></wrapper>'

>>> h = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>'
>>> xml = h + s.encode('utf8')
>>> for event,elem in iterparse(StringIO(xml)):
...     print elem.tag, repr(elem.text)
...
name u'\u03a0\u03b1'
digits '01234567'
wrapper None
>>>

HTH,
John