iterparse and unicode

Sun Aug 24 01:29:09 EDT 2008

George Sakkis wrote:
> It seems xml.etree.cElementTree.iterparse() is not unicode aware:
> 
>>>> from StringIO import StringIO
>>>> from xml.etree.cElementTree import iterparse
>>>> s = u'<name>\u03a0\u03b1\u03bd\u03b1\u03b3\u03b9\u03ce\u03c4\u03b7\u03c2</name>'
>>>> for event,elem in iterparse(StringIO(s)):
> ...     print elem.text
> ...
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "<string>", line 64, in __iter__
> UnicodeEncodeError: 'ascii' codec can't encode characters in position
> 6-15: ordinal not in range(128)
> 
> Am I using it incorrectly or it doesn't currently support unicode ?

If you want to parse XML from Python unicode strings, you can use lxml.etree.
The XML specification allows transport protocols and other sources to provide
external encoding information. lxml supports the Python unicode type as a
transport and reads the internal byte sequence of the unicode string.

To be clear, this does not mean that the parsing happens at the unicode
character level. Parsing XML is about parsing bytes, not characters.

Stefan