elementtree XML() unicode

Tue Nov 3 21:06:58 EST 2009

On Nov 3, 2009, at 5:27 PM, John Machin wrote:

> On Nov 4, 11:01 am, Kee Nethery <k... at kagi.com> wrote:
>> Having an issue with elementtree XML() in python 2.6.4.
>>
>> This code works fine:
>>
>>       from xml.etree import ElementTree as et
>>       getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
>> <customer><shipping><state>bobble</state><city>head</
>> city><street>city</street></shipping></customer>'''
>>       theResponseXml = et.XML(getResponse)
>>
>> This code errors out when it tries to do the et.XML()
>>
>>       from xml.etree import ElementTree as et
>>       getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
>> <customer><shipping><state>\ue58d83\ue89189\ue79c8C</state><city>
>> \ue69f8f\ue5b882</city><street>\ue9ab98\ue58d97\ue58fb03</street></
>> shipping></customer>'''
>>       theResponseXml = et.XML(getResponse)
>>
>> In my real code, I'm pulling the getResponse data from a web page  
>> that
>> returns as XML and when I display it in the browser you can see the
>> Japanese characters in the data. I've removed all the stuff in my  
>> code
>> and tried to distill it down to just what is failing. Hopefully I  
>> have
>> not removed something essential.
>>
>> Why is this not working and what do I need to do to use Elementtree
>> with unicode?
>
> On Nov 4, 11:01 am, Kee Nethery <k... at kagi.com> wrote:
>> Having an issue with elementtree XML() in python 2.6.4.
>>
>> This code works fine:
>>
>>      from xml.etree import ElementTree as et
>>      getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
>> <customer><shipping><state>bobble</state><city>head</
>> city><street>city</street></shipping></customer>'''
>>      theResponseXml = et.XML(getResponse)
>>
>> This code errors out when it tries to do the et.XML()
>>
>>      from xml.etree import ElementTree as et
>>      getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
>> <customer><shipping><state>\ue58d83\ue89189\ue79c8C</state><city>
>> \ue69f8f\ue5b882</city><street>\ue9ab98\ue58d97\ue58fb03</street></
>> shipping></customer>'''
>>      theResponseXml = et.XML(getResponse)
>>
>> In my real code, I'm pulling the getResponse data from a web page  
>> that
>> returns as XML and when I display it in the browser you can see the
>> Japanese characters in the data. I've removed all the stuff in my  
>> code
>> and tried to distill it down to just what is failing. Hopefully I  
>> have
>> not removed something essential.
>>
>> Why is this not working and what do I need to do to use Elementtree
>> with unicode?
>
> What you need to do is NOT feed it unicode. You feed it a str object
> and it gets decoded according to the encoding declaration found in the
> first line.

That it uses "the encoding declaration found in the first line" is the  
nugget of data that is not in the documentation that has stymied me  
for days. Thank you!

The other thing that has been confusing is that I've been using "dump"  
to view what is in the elementtree instance and the non-ASCII  
characters have been displayed as "numbered  
entities" (<city>柏市</city>) and I know that is not the  
representation I want the data to be in. A co-worker suggested that  
instead of "dump" that I use "et.tostring(theResponseXml,  
encoding='utf-8')" and then print that to see the characters. That  
process causes the non-ASCII characters to display as the glyphs I  
know them to be.

If there was a place in the official docs for me to append these  
nuggets of information to the sections for  
"xml.etree.ElementTree.XML(text)" and  
"xml.etree.ElementTree.dump(elem)" I would absolutely do so.

Thank you!
Kee Nethery

> So take the str object that you get from the web (should
> be UTF8-encoded already unless the header is lying), and throw that at
> ET ... like this:
>
> | Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit
> (Intel)] on win32
> | Type "help", "copyright", "credits" or "license" for more
> information.
> | >>> from xml.etree import ElementTree as et
> | >>> ucode = u'''<?xml version="1.0" encoding="UTF-8"?>
> | ... <customer><shipping>
> | ... <state>\ue58d83\ue89189\ue79c8C</state>
> | ... <city>\ue69f8f\ue5b882</city>
> | ... <street>\ue9ab98\ue58d97\ue58fb03</street>
> | ... </shipping></customer>'''
> | >>> xml= et.XML(ucode)
> | Traceback (most recent call last):
> |   File "<stdin>", line 1, in <module>
> |   File "C:\python26\lib\xml\etree\ElementTree.py", line 963, in XML
> |     parser.feed(text)
> |   File "C:\python26\lib\xml\etree\ElementTree.py", line 1245, in
> feed
> |     self._parser.Parse(data, 0)
> | UnicodeEncodeError: 'ascii' codec can't encode character u'\ue58d'
> in position 69: ordinal not in range(128)
> | # as expected
> | >>> strg = ucode.encode('utf8')
> | # encoding as utf8 is for DEMO purposes.
> | # i.e. use the original web str object, don't convert it to unicode
> | # and back to utf8.
> | >>> xml2 = et.XML(strg)
> | >>> xml2.tag
> | 'customer'
> | >>> for c in xml2.getchildren():
> | ...    print c.tag, repr(c.text)
> | ...
> | shipping '\n'
> | >>> for c in xml2[0].getchildren():
> | ...    print c.tag, repr(c.text)
> | ...
> | state u'\ue58d83\ue89189\ue79c8C'
> | city u'\ue69f8f\ue5b882'
> | street u'\ue9ab98\ue58d97\ue58fb03'
> | >>>
>
> By the way: (1) it usually helps to be more explicit than "errors
> out", preferably the exact copied/pasted output as shown above; this
> is one of the rare cases where the error message is predictable (2)
> PLEASE don't start a new topic in a reply in somebody else's thread.
>
> -- 
> http://mail.python.org/mailman/listinfo/python-list

-------------------------------------------------
I check email roughly 2 to 3 times per business day.
Kagi main office: +1 (510) 550-1336