iterparse and unicode

Wed Aug 20 20:41:23 EDT 2008

On Wed, 2008-08-20 at 15:36 -0700, George Sakkis wrote:
> It seems xml.etree.cElementTree.iterparse() is not unicode aware:
> 
> >>> from StringIO import StringIO
> >>> from xml.etree.cElementTree import iterparse
> >>> s = u'<name>\u03a0\u03b1\u03bd\u03b1\u03b3\u03b9\u03ce\u03c4\u03b7\u03c2</name>'
> >>> for event,elem in iterparse(StringIO(s)):
> ...     print elem.text
> ...
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "<string>", line 64, in __iter__
> UnicodeEncodeError: 'ascii' codec can't encode characters in position
> 6-15: ordinal not in range(128)
> 
> Am I using it incorrectly or it doesn't currently support unicode ?
> 
> George
> --
> http://mail.python.org/mailman/listinfo/python-list

As iterparse expects an actual file as input, using a unicode string is
problematic. If you want to use iterparse, the simplest way would be to
encode your string before inserting it into the StringIO object, as so:

>>> for event,elem in iterparse(StringIO(s.encode('UTF8')):
...     print elem.text
...

If you encode using UTF-8, you don't need to worry about the <?xml header 
bit as suggested previously, as it's the default for XML.

If you're using unicode extensively, you should consider using lxml, 
which implements the same interface as ElementTree, but handles unicode 
better (though it also doesn't run your example above without first 
encoding the string):
http://codespeak.net/lxml/parsing.html#python-unicode-strings

You may also find the target parser interface to be more accepting of 
unicode than iterparse, though it requires a different parsing interface:
http://codespeak.net/lxml/parsing.html#the-target-parser-interface

-- 
John Krukoff <jkrukoff at ltgc.com>
Land Title Guarantee Company