ElementTree XML parsing problem

Wed Apr 27 15:24:52 EDT 2011

On 2011-04-27, Mike <Mike at invalid.invalid> wrote:
> I'm using ElementTree to parse an XML file, but it stops at the
> second record (id = 002), which contains a non-standard ascii
> character, ?. Here's the XML:
>
><?xml version="1.0"?>
><snapshot time="Mon Apr 25 08:47:23 PDT 2011">
><records>
><record id="001" education="High School" employment="7 yrs" />
><record id="002" education="Universit?t Bremen" employment="3 years" />
><record id="003" education="River College" employment="5 yrs" />
></records>
></snapshot>
>
> The complaint offered up by the parser is
>
> Unexpected error opening simple_fail.xml: not well-formed
> (invalid token): line 5, column 40

It seems to be an invalid XML document, as another poster
indicated.

> and if I change the line to eliminate the ?, everything is
> wonderful. The parser is perfectly happy with this
> modification:
>
> <record id="002" education="University Bremen" employment="3
> yrs" />
>
> I can't find anything in the ElementTree docs about allowing
> additional text characters or coercing strange ascii to
> Unicode.

If you're not the one generating that bogus file, then you can
specify the encoding yourself instead by declaring an XMLParser.

  import xml.etree.ElementTree as etree
  with open('file.xml') as xml_file:
    parser = etree.XMLParser(encoding='ISO-8859-1')
    root = etree.parse(xml_file, parser=parser).getroot()

-- 
Neil Cerutti