[XML-SIG] parsing xml files delimited with non-xml text

Brian Birkinbine bbirkinbine@earthlink.net
Tue, 23 Apr 2002 15:58:21 -0500


I appreciate the feedback on this.  I'm sure my newness is showing here, but I am getting
my question answered, so I don't mind having others save me from myself. :)

Based on the comments I've received from this list.
I've decided to pass xml compliant data to the xmlparser and rely on other functions to
strip out the non-xml data before xml processing.

I guess I started down this road because my xml handler was a subclass of xml.sax.ContentHandler
to allow me the ability handle normal xml processing and I was hoping of a way to start/stop
the xml checking rather than write another parsing routine to strip out non-xml data (implying
that the sanitizing routine knew when xml data started/stopped)

This is a non-public program I am writing to help automate a lot of manual
processes for me on a large number of UNIX systems.

Thanks again,
---
Brian Birkinbine <bbirkinbine@earthlink.net>
GnuPG/PGP Key: 0x37D55FF6

> > > > I would prefer to use exception handling because my functions to strip out non-xml data
> > > > would have to recognize the start of an xml file, and the xml parser already knows
> > > > how to detect the start of xml data.
> > > 
> > > Not really. It *assumes* the input is well-formed XML. No XML parser I
> > > know of (except possibly MSXML) is designed to detect XML embedded in
> > > non-XML.
> > 
> >   Actually, the XML specification is relatively clear, the parser cannot
> > guess the end of the input:
> >     http://www.w3.org/TR/REC-xml#NT-document
> >     [1]    document    ::=    prolog element Misc*
> > 
> >  Misc* indicates that the parser cannot find by itself the end of the
> > content, the parser has to be informed of the end of the stream.
> > Anything after the root and till this point must conform to Misc*,
> > and if not it is actaully a fatal error.
> 
> And Misc consists of processing instructions, comments, and white space.
> Okay ... I'm not sure exactly what your point is, but you seem to be
> responding to my statement that the parser *assumes* the input is well-
> formed XML. That was probably a poor choice of words: what I really meant
> was that XML parsers are designed to accept well-formed XML and reject
> anything else. Period.
> 
> As for Brian's idea, well, I don't see why he has to comply with the spec,
> as long as a) he is aware his application is non-compliant, and b) he isn't
> distributing it to the public. So it seems to me that assuming he doesn't
> care about whatever Misc* might be present, if the end tag of the root
> element is present and he codes the right exception-handling kludges to
> deal with the stuff at the end, his approach is feasible, is it not? Or
> should we be trying to save him from himself?