handling xml embedded within xml

Avowkind Avowkind at gmail.com
Sun May 18 23:41:11 EDT 2008


I have a log file within which is contained a dump of an xml message

... rubbish
///asd laksj aslf
<nif_DEBUG time="Fri, 16 May 2008 13:40:17, 330">
<?xml version="1.0" encoding="UTF-8"?>
<ns>
    <PDQ Lang="fr-FR" ID="XM;1928">content</PDQ>
</ns>
</nif_DEBUG>
.. more junk
... then more xml
""")
This example is of course a summary.

I want to write a streaming filter which will throw out all the junk
and just return a series of nice strings of each complete xml
message.  Ideally I also want to filter which messages I am interested
in.

e.g. the output from the above would be
<?xml version="1.0" encoding="UTF-8"?>
<ns>
    <PDQ Lang="fr-FR" ID="XM;1928">content</PDQ>
</ns>

Two problems.
1. clearing away junk that is nothing like XML.
2. handling the <? xml declaration that lies inside the other xml
tags.

the first I can handle relatively simply by reading through the string
until I get what looks like a valid XML tag.  I can then pass the rest
onto an xml parser like xml.sax. However the parser then excepts out
with :
XMLSyntaxError: XML declaration allowed only at the start of the
document

I would like a more forgiving parser that handles bad xml by a call
back that I can just say carry on to.
Bear in mind also I probably will not have the end of the stream while
initially processing.

All suggestions and pointers welcome
Andrew





More information about the Python-list mailing list