elementtree XML() unicode

Tue Nov 3 19:44:54 EST 2009

En Tue, 03 Nov 2009 21:01:46 -0300, Kee Nethery <kee at kagi.com> escribió:

> Having an issue with elementtree XML() in python 2.6.4.
>
> This code works fine:
>
>       from xml.etree import ElementTree as et
>       getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>  
> <customer><shipping><state>bobble</state><city>head</ 
> city><street>city</street></shipping></customer>'''
>       theResponseXml = et.XML(getResponse)
>
> This code errors out when it tries to do the et.XML()
>
>       from xml.etree import ElementTree as et
>       getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>  
> <customer><shipping><state>\ue58d83\ue89189\ue79c8C</state><city> 
> \ue69f8f\ue5b882</city><street>\ue9ab98\ue58d97\ue58fb03</street></ 
> shipping></customer>'''
>       theResponseXml = et.XML(getResponse)
>
> In my real code, I'm pulling the getResponse data from a web page that  
> returns as XML and when I display it in the browser you can see the  
> Japanese characters in the data. I've removed all the stuff in my code  
> and tried to distill it down to just what is failing. Hopefully I have  
> not removed something essential.
>
> Why is this not working and what do I need to do to use Elementtree with  
> unicode?

et expects bytes as input, not unicode. You're decoding too early  
(decoding early is good, but not in this case, because et does the work  
for you). Either feed et.XML with the bytes before decoding, or reencode  
the received xml text in UTF-8 (since this is the declared encoding).

-- 
Gabriel Genellina