Parsing XML with ElementTree (unicode problem?)

Tue Jul 24 03:40:39 EDT 2007

oren.tsur at gmail.com wrote:
> On Jul 23, 4:46 pm, "Richard Brodie" <R.Bro... at rl.ac.uk> wrote:
>> <oren.t... at gmail.com> wrote in message
>>
>> news:1185200976.082516.105420 at 57g2000hsv.googlegroups.com...
>>
>>> so what's the difference? how comes parsing is fine
>>> in the first case but erroneous in the second case?
>> You may have guessed the encoding wrong. It probably
>> wasn't utf-8 to start with but iso8859-1 or similar.
>> What actual byte value is in the file?
> 
> I tried it with different encodings and it didn't work. Anyways, I
> would expect it to be utf-8 since the XML response to the amazon query
> indicates a utf-8 (check it with
> http://ecs.amazonaws.com/onca/xml?Service=AWSECommerceService&AWSAccessKeyId=189P5TE3VP7N9MN0G302&Operation=ItemLookup&ItemId=1400079179&ResponseGroup=Reviews&ReviewPage=166
> 
>  in your browser, the first line in the source is <?xml version="1.0"
> encoding="UTF-8"?>)
> 
> but the thing is that the parser parses it all right from the web (the
> amazon response) but fails to parse the locally saved file.

Then how did you save it to a file? Using your browser? Maybe that messed it
up? Or did you edit it with an Editor that doesn't understand UTF-8?

If you want to extract the interesting stuff programmatically, you can use
lxml.etree. It's ElementTree compatible, but it can parse right from HTTP URLs
and it supports XPath for selecting stuff.

http://codespeak.net/lxml/

Stefan