Newbie XML SAX Parsing: How do I ignore an invalid token?

scott at crybabymaternity.com scott at crybabymaternity.com
Fri Jan 5 22:30:26 EST 2007


My original posting has a funky line break character (it appears as an
ascii square) that blows up my program, but it may or may not show up
when you view my message.

I was afraid to use element tree, since my xml files can be very long,
and I was concerned about using memory structures to hold all the data.
It is my understanding that SAX reads the file line by line?

Is there a way to account for the invalid token in the error handler? I
don't mind parsing out the bad characters on a case-by-case basis. The
weather data I am ingesting only seems to have this line break
character that the parser doesn't like.

Thanks!

Scott

Chris Lambacher wrote:
> What exactly is invalid about the XML fragment you provided?
> It seems to parse correctly with ElementTree:
> >>> from xml.etree import ElementTree as ET
> >>> e =  ET.fromstring("""
> ... <cities>
> ...   <city>
> ...     <name>Tampa</name>
> ...     <description>A great city ^^ and place to live</description>
> ...   </city>
> ...   <city>
> ...     <name>Clearwater</name>
> ...     <description>Beautiful beaches</description>
> ...   </city>
> ... </cities>
> ... """)
> >>> print ET.tostring(e)
> <cities>
>   <city>
>     <name>Tampa</name>
>     <description>A great city ^^ and place to live</description>
>   </city>
>   <city>
>     <name>Clearwater</name>
>     <description>Beautiful beaches</description>
>   </city>
> </cities>
> >>>
>
> Do you have invalid characters? unclosed tags?  The solution to each of these
> problems is different.  More info will solicit better solutions.
>
> -Chris
>
> On Fri, Jan 05, 2007 at 01:50:18PM -0800, scott at crybabymaternity.com wrote:
> > I've got an XML feed from a vendor that is not well-formed, and having
> > them change it is not an option.  I'm trying to figure out how to
> > create an error-handler that will ignore the invalid token and continue
> > on.
> >
> > The file is large, so I'd prefer not to put it all in memory or save it
> > off and strip out the bad characters before I parse it.
> >
> > I've included one of the problematic characters in a small XML snippet
> > below.
> >
> > I'm new to Python, and I don't know how to accomplish this. Any help is
> > greatly appreciated!
> >
> > -----------------------------------------------------------------
> >
> > Here is my code:
> >
> > from xml.sax import make_parser
> > from xml.sax.handler import ContentHandler
> > import StringIO
> >
> > class ErrorHandler:
> >     def __init__(self, parser):
> >         self.parser = parser
> >     def warning(self, msg):
> >         print '*** (ErrorHandler.warning) msg:', msg
> >     def error(self, msg):
> >         print '*** (ErrorHandler.error) msg:', msg
> >     def fatalError(self, msg):
> >         print msg
> >
> > class ContentHandler(ContentHandler):
> >     def __init__ (self):
> >         pass
> >     def startElement(self, name, attrs):
> >         pass
> >     def characters (self, ch):
> >         pass
> >     def endElement(self, name):
> >         pass
> >
> > xmlstr = """
> > <cities>
> >   <city>
> >     <name>Tampa</name>
> >     <description>A great city  and place to live</description>
> >   </city>
> >   <city>
> >     <name>Clearwater</name>
> >     <description>Beautiful beaches</description>
> >   </city>
> > </cities>
> > """
> > parser = make_parser()
> > curHandler = ContentHandler()
> > errorHandler = ErrorHandler(parser)
> > parser.setContentHandler(curHandler)
> > parser.setErrorHandler(errorHandler)
> > parser.parse(StringIO.StringIO(xmlstr))
> > 
> > -- 
> > http://mail.python.org/mailman/listinfo/python-list




More information about the Python-list mailing list