ElementTree XML parsing problem

Wed Apr 27 16:43:20 EDT 2011

On 4/27/2011 12:24 PM, Neil Cerutti wrote:
> On 2011-04-27, Mike<Mike at invalid.invalid>  wrote:
>> I'm using ElementTree to parse an XML file, but it stops at the
>> second record (id = 002), which contains a non-standard ascii
>> character, ?. Here's the XML:
>>
>> <?xml version="1.0"?>
>> <snapshot time="Mon Apr 25 08:47:23 PDT 2011">
>> <records>
>> <record id="001" education="High School" employment="7 yrs" />
>> <record id="002" education="Universit?t Bremen" employment="3 years" />
>> <record id="003" education="River College" employment="5 yrs" />
>> </records>
>> </snapshot>
>>
>> The complaint offered up by the parser is
>>
>> Unexpected error opening simple_fail.xml: not well-formed
>> (invalid token): line 5, column 40
>
> It seems to be an invalid XML document, as another poster
> indicated.
>
>> and if I change the line to eliminate the ?, everything is
>> wonderful. The parser is perfectly happy with this
>> modification:
>>
>> <record id="002" education="University Bremen" employment="3
>> yrs" />
>>
>> I can't find anything in the ElementTree docs about allowing
>> additional text characters or coercing strange ascii to
>> Unicode.
>
> If you're not the one generating that bogus file, then you can
> specify the encoding yourself instead by declaring an XMLParser.
>
>    import xml.etree.ElementTree as etree
>    with open('file.xml') as xml_file:
>      parser = etree.XMLParser(encoding='ISO-8859-1')
>      root = etree.parse(xml_file, parser=parser).getroot()
>

Thanks, Neil. I'm not generating the file, just trying to parse it. Your 
solution is precisely what I was looking for, even if I didn't quite ask 
correctly. I appreciate the help!

-- Mike --