Newbie XML SAX Parsing: How do I ignore an invalid token?

Chris Lambacher chris at kateandchris.net
Fri Jan 5 17:45:45 EST 2007


What exactly is invalid about the XML fragment you provided?
It seems to parse correctly with ElementTree:
>>> from xml.etree import ElementTree as ET
>>> e =  ET.fromstring("""
... <cities>
...   <city>
...     <name>Tampa</name>
...     <description>A great city ^^ and place to live</description>
...   </city>
...   <city>
...     <name>Clearwater</name>
...     <description>Beautiful beaches</description>
...   </city>
... </cities>
... """)
>>> print ET.tostring(e)
<cities>
  <city>
    <name>Tampa</name>
    <description>A great city ^^ and place to live</description>
  </city>
  <city>
    <name>Clearwater</name>
    <description>Beautiful beaches</description>
  </city>
</cities>
>>>

Do you have invalid characters? unclosed tags?  The solution to each of these
problems is different.  More info will solicit better solutions.

-Chris

On Fri, Jan 05, 2007 at 01:50:18PM -0800, scott at crybabymaternity.com wrote:
> I've got an XML feed from a vendor that is not well-formed, and having
> them change it is not an option.  I'm trying to figure out how to
> create an error-handler that will ignore the invalid token and continue
> on.
> 
> The file is large, so I'd prefer not to put it all in memory or save it
> off and strip out the bad characters before I parse it.
> 
> I've included one of the problematic characters in a small XML snippet
> below.
> 
> I'm new to Python, and I don't know how to accomplish this. Any help is
> greatly appreciated!
> 
> -----------------------------------------------------------------
> 
> Here is my code:
> 
> from xml.sax import make_parser
> from xml.sax.handler import ContentHandler
> import StringIO
> 
> class ErrorHandler:
>     def __init__(self, parser):
>         self.parser = parser
>     def warning(self, msg):
>         print '*** (ErrorHandler.warning) msg:', msg
>     def error(self, msg):
>         print '*** (ErrorHandler.error) msg:', msg
>     def fatalError(self, msg):
>         print msg
> 
> class ContentHandler(ContentHandler):
>     def __init__ (self):
>         pass
>     def startElement(self, name, attrs):
>         pass
>     def characters (self, ch):
>         pass
>     def endElement(self, name):
>         pass
> 
> xmlstr = """
> <cities>
>   <city>
>     <name>Tampa</name>
>     <description>A great city  and place to live</description>
>   </city>
>   <city>
>     <name>Clearwater</name>
>     <description>Beautiful beaches</description>
>   </city>
> </cities>
> """
> parser = make_parser()
> curHandler = ContentHandler()
> errorHandler = ErrorHandler(parser)
> parser.setContentHandler(curHandler)
> parser.setErrorHandler(errorHandler)
> parser.parse(StringIO.StringIO(xmlstr))
> 
> -- 
> http://mail.python.org/mailman/listinfo/python-list



More information about the Python-list mailing list