[XML-SIG] parsing xml files delimited with non-xml text

Andy Robinson andy@reportlab.com
Tue, 23 Apr 2002 23:18:32 +0100


> Should I strip out the non-xml data separately into xml-compliant 
> pieces before
> calling the parse routine, or can I use exception handling within 
> the xml routines
> to ignore the non-xml data until I see valid xml data?

Does the non-xml data consist of HTML tags (i.e. you
have XML chunks embedded in web pages), or totally
unrelated stuff like
   ================xml begins here=============
?  

If the latter, a pragmatic approach says that string.split,
re and friends will do a pretty good job of cutting
it up.  If the former, I see the temptation to try and
get away with a single parser, but you may be better 
using sgmlop or another non-xml parser to break things 
into chunks.  HTML parsers don't choke on singleton
tags, missing quotes and other things.  

Show us all a snippet and we'll be able to tell you the
most pragmatic route!

- Andy Robinson