How to get xml.etree.ElementTree not bomb on invalid characters in XML file ?

Barak, Ron Ron.Barak at lsi.com
Tue May 4 10:11:17 EDT 2010


 
> -----Original Message-----
> From: Stefan Behnel [mailto:stefan_ml at behnel.de] 
> Sent: Tuesday, May 04, 2010 10:24 AM
> To: python-list at python.org
> Subject: Re: How to get xml.etree.ElementTree not bomb on 
> invalid characters in XML file ?
> 
> Barak, Ron, 04.05.2010 09:01:
> >  I'm parsing XML files using ElementTree from xml.etree (see code 
> > below (and attached xml_parse_example.py)).
> >
> > However, I'm coming across input XML files (attached an example:
> > tmp.xml) which include invalid characters, that produce the 
> following
> > traceback:
> >
> > $ python xml_parse_example.py
> > Traceback (most recent call last):
> > xml.parsers.expat.ExpatError: not well-formed (invalid 
> token): line 6, 
> > column 34
> 
> I hope you are aware that this means that the input you are 
> parsing is not XML. It's best to reject the file and tell the 
> producers that they are writing broken output files. You 
> should always fix the source, instead of trying to make sense 
> out of broken input in fragile ways.
> 
> 
> > I read the documentation for xml.etree.ElementTree and see 
> that it may 
> > take an optional parser parameter, but I don't know what 
> this parser 
> > should be - to ignore the invalid characters.
> >
> > Could you suggest a way to call ElementTree, so it won't 
> bomb on these 
> > invalid characters ?
> 
> No. The parser in lxml.etree has a 'recover' option that lets 
> it try to recover from input errors, but in general, XML 
> parsers are required to reject non well-formed input.
> 
> Stefan
> 
> 
> 

Hi Stefan,
The XML file seems to be valid XML (all XML viewers I tried were able to read it). 
You can verify this by trying to read the XML example I attached to the original message (attached again here).
Actually, when trying to view the file with an XML viewer, these offensive characters are not shown.
It's just that some of the fields include characters that the parser used by ElementTree seems to chock on.
Bye,
Ron.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tmp_small.xml
Type: application/xml
Size: 637 bytes
Desc: tmp_small.xml
URL: <http://mail.python.org/pipermail/python-list/attachments/20100504/29577f47/attachment-0001.xml>


More information about the Python-list mailing list