elementtree XML() unicode

John Machin sjmachin at lexicon.net
Tue Nov 3 20:56:38 EST 2009


On Nov 4, 12:14 pm, Kee Nethery <k... at kagi.com> wrote:
> On Nov 3, 2009, at 4:44 PM, Gabriel Genellina wrote:
>
> > En Tue, 03 Nov 2009 21:01:46 -0300, Kee Nethery <k... at kagi.com>  
> > escribió:
>
> >> I've removed all the stuff in my code and tried to distill it down  
> >> to just what is failing. Hopefully I have not removed something  
> >> essential.
>
> Sounds like I did remove something essential.

No, you added something that was not only inessential but caused
trouble.

> > et expects bytes as input, not unicode. You're decoding too early  
> > (decoding early is good, but not in this case, because et does the  
> > work for you). Either feed et.XML with the bytes before decoding, or  
> > reencode the received xml text in UTF-8 (since this is the declared  
> > encoding).
>
> Here is the code that hits the URL:
>          getResponse1 = urllib2.urlopen(theUrl)
>          getResponse2 = getResponse1.read()
>          getResponse3 = unicode(getResponse2,'UTF-8')
>         theResponseXml = et.XML(getResponse3)
>
> So are you saying I want to do:
>          getResponse1 = urllib2.urlopen(theUrl)
>          getResponse4 = getResponse1.read()
>         theResponseXml = et.XML(getResponse4)

You got the essence. Note: that in no way implies any approval of your
naming convention :-)

> The reason I am confused is that getResponse2 is classified as an  
> "str" in the Komodo IDE. I want to make sure I don't lose the non-
> ASCII characters coming from the URL.

str is all about 8-bit bytes. Your data comes from the web in 8-bit
bytes. No problem. Just don't palpate it unnecessarily.

> If I do the second set of code,  
> does elementtree auto convert the str into unicode?

Yes. See the example I gave in my earlier posting:

| ...    print c.tag, repr(c.text)
| state u'\ue58d83\ue89189\ue79c8C'

That first u means the type is unicode.

> How do I deal with  
> the XML as unicode when I put it into elementtree as a string?

That's unfortunately rather ambiguous: (1) put past/present? (2)
string unicode/str? (3) what is referent of "it"?

All text in what et returns is unicode [*] so you read it out as
unicode (see above example) or written as unicode if you want to
change it:

    your_element.text = u'a unicode object'

[*] As an "optimisation", et stores strings as str objects if they
contain only ASCII bytes (and are thus losslessly convertible to
unicode). In preparation for running your code under Python 3.X, it's
best to ignore this and use unicode constants u'foo' (if you need text
constants at all) even if et would let you get away with 'foo'.

HTH,
John



More information about the Python-list mailing list