Is it possible to consume UTF8 XML documents using xml.dom.pulldom?

Simon Willison simon at simonwillison.net
Wed Jul 30 10:32:16 EDT 2008


I'm having a horrible time trying to get xml.dom.pulldom to consume a
UTF8 encoded XML file. Here's what I've tried so far:

>>> xml_utf8 = """<?xml version="1.0" encoding="UTF-8" ?>
<msg>Simon\xe2\x80\x99s XML nightmare</msg>
"""
>>> from xml.dom import pulldom
>>> parser = pulldom.parseString(xml_utf8)
>>> parser.next()
('START_DOCUMENT', <xml.dom.minidom.Document instance at 0x6f06c0>)
>>> parser.next()
('START_ELEMENT', <DOM Element: msg at 0x6f0710>)
>>> parser.next()
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in
position 21: ordinal not in range(128)

xml.dom.minidom can handle the string just fine:

>>> from xml.dom import minidom
>>> dom = minidom.parseString(xml_utf8)
>>> dom.toxml()
u'<?xml version="1.0" ?><msg>Simon\u2019s XML nightmare</msg>'

If I pass a unicode string to pulldom instead of a utf8 encoded
bytestring it still breaks:

>>> xml_unicode = u'<?xml version="1.0" ?><msg>Simon\u2019s XML nightmare</msg>'
>>> parser = pulldom.parseString(xml_unicode)
...
/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
xml/dom/pulldom.py in parseString(string, parser)
    346
    347     bufsize = len(string)
--> 348     buf = StringIO(string)
    349     if not parser:
    350         parser = xml.sax.make_parser()
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in
position 32: ordinal not in range(128)

Is it possible to consume utf8 or unicode using xml.dom.pulldom or
should I try something else?

Thanks,

Simon Willison



More information about the Python-list mailing list