[XML-SIG] parsing xml files delimited with non-xml text

Matt Gushee Matt Gushee <mgushee@havenrock.com>
Tue, 23 Apr 2002 12:16:07 -0600


On Tue, Apr 23, 2002 at 01:49:00PM -0400, Daniel Veillard wrote:
> On Tue, Apr 23, 2002 at 11:39:19AM -0600, Matt Gushee wrote:
> > On Tue, Apr 23, 2002 at 11:57:48AM -0500, Brian Birkinbine wrote:
> > > 
> > > I would prefer to use exception handling because my functions to strip out non-xml data
> > > would have to recognize the start of an xml file, and the xml parser already knows
> > > how to detect the start of xml data.
> > 
> > Not really. It *assumes* the input is well-formed XML. No XML parser I
> > know of (except possibly MSXML) is designed to detect XML embedded in
> > non-XML.
> 
>   Actually, the XML specification is relatively clear, the parser cannot
> guess the end of the input:
>     http://www.w3.org/TR/REC-xml#NT-document
>     [1]    document    ::=    prolog element Misc*
> 
>  Misc* indicates that the parser cannot find by itself the end of the
> content, the parser has to be informed of the end of the stream.
> Anything after the root and till this point must conform to Misc*,
> and if not it is actaully a fatal error.

And Misc consists of processing instructions, comments, and white space.
Okay ... I'm not sure exactly what your point is, but you seem to be
responding to my statement that the parser *assumes* the input is well-
formed XML. That was probably a poor choice of words: what I really meant
was that XML parsers are designed to accept well-formed XML and reject
anything else. Period.

As for Brian's idea, well, I don't see why he has to comply with the spec,
as long as a) he is aware his application is non-compliant, and b) he isn't
distributing it to the public. So it seems to me that assuming he doesn't
care about whatever Misc* might be present, if the end tag of the root
element is present and he codes the right exception-handling kludges to
deal with the stuff at the end, his approach is feasible, is it not? Or
should we be trying to save him from himself?

-- 
Matt Gushee
Englewood, Colorado, USA
mgushee@havenrock.com
http://www.havenrock.com/