elementtree XML() unicode

Tue Nov 3 20:27:12 EST 2009

On Nov 4, 11:01 am, Kee Nethery <k... at kagi.com> wrote:
> Having an issue with elementtree XML() in python 2.6.4.
>
> This code works fine:
>
>       from xml.etree import ElementTree as et
>       getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>  
> <customer><shipping><state>bobble</state><city>head</
> city><street>city</street></shipping></customer>'''
>       theResponseXml = et.XML(getResponse)
>
> This code errors out when it tries to do the et.XML()
>
>       from xml.etree import ElementTree as et
>       getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>  
> <customer><shipping><state>\ue58d83\ue89189\ue79c8C</state><city>
> \ue69f8f\ue5b882</city><street>\ue9ab98\ue58d97\ue58fb03</street></
> shipping></customer>'''
>       theResponseXml = et.XML(getResponse)
>
> In my real code, I'm pulling the getResponse data from a web page that  
> returns as XML and when I display it in the browser you can see the  
> Japanese characters in the data. I've removed all the stuff in my code  
> and tried to distill it down to just what is failing. Hopefully I have  
> not removed something essential.
>
> Why is this not working and what do I need to do to use Elementtree  
> with unicode?

On Nov 4, 11:01 am, Kee Nethery <k... at kagi.com> wrote:
> Having an issue with elementtree XML() in python 2.6.4.
>
> This code works fine:
>
>       from xml.etree import ElementTree as et
>       getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
> <customer><shipping><state>bobble</state><city>head</
> city><street>city</street></shipping></customer>'''
>       theResponseXml = et.XML(getResponse)
>
> This code errors out when it tries to do the et.XML()
>
>       from xml.etree import ElementTree as et
>       getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
> <customer><shipping><state>\ue58d83\ue89189\ue79c8C</state><city>
> \ue69f8f\ue5b882</city><street>\ue9ab98\ue58d97\ue58fb03</street></
> shipping></customer>'''
>       theResponseXml = et.XML(getResponse)
>
> In my real code, I'm pulling the getResponse data from a web page that
> returns as XML and when I display it in the browser you can see the
> Japanese characters in the data. I've removed all the stuff in my code
> and tried to distill it down to just what is failing. Hopefully I have
> not removed something essential.
>
> Why is this not working and what do I need to do to use Elementtree
> with unicode?

What you need to do is NOT feed it unicode. You feed it a str object
and it gets decoded according to the encoding declaration found in the
first line. So take the str object that you get from the web (should
be UTF8-encoded already unless the header is lying), and throw that at
ET ... like this:

| Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit
(Intel)] on win32
| Type "help", "copyright", "credits" or "license" for more
information.
| >>> from xml.etree import ElementTree as et
| >>> ucode = u'''<?xml version="1.0" encoding="UTF-8"?>
| ... <customer><shipping>
| ... <state>\ue58d83\ue89189\ue79c8C</state>
| ... <city>\ue69f8f\ue5b882</city>
| ... <street>\ue9ab98\ue58d97\ue58fb03</street>
| ... </shipping></customer>'''
| >>> xml= et.XML(ucode)
| Traceback (most recent call last):
|   File "<stdin>", line 1, in <module>
|   File "C:\python26\lib\xml\etree\ElementTree.py", line 963, in XML
|     parser.feed(text)
|   File "C:\python26\lib\xml\etree\ElementTree.py", line 1245, in
feed
|     self._parser.Parse(data, 0)
| UnicodeEncodeError: 'ascii' codec can't encode character u'\ue58d'
in position 69: ordinal not in range(128)
| # as expected
| >>> strg = ucode.encode('utf8')
| # encoding as utf8 is for DEMO purposes.
| # i.e. use the original web str object, don't convert it to unicode
| # and back to utf8.
| >>> xml2 = et.XML(strg)
| >>> xml2.tag
| 'customer'
| >>> for c in xml2.getchildren():
| ...    print c.tag, repr(c.text)
| ...
| shipping '\n'
| >>> for c in xml2[0].getchildren():
| ...    print c.tag, repr(c.text)
| ...
| state u'\ue58d83\ue89189\ue79c8C'
| city u'\ue69f8f\ue5b882'
| street u'\ue9ab98\ue58d97\ue58fb03'
| >>>

By the way: (1) it usually helps to be more explicit than "errors
out", preferably the exact copied/pasted output as shown above; this
is one of the rare cases where the error message is predictable (2)
PLEASE don't start a new topic in a reply in somebody else's thread.