Parsing XML streams

Alan Kennedy alanmk at hotmail.com
Fri Sep 12 05:58:49 EDT 2003


Peter Scott wrote:
> I'm writing another program that should parse that sort of XML on its
> stdin, printing out a more user-friendly representation. For this, I
> need to parse the XML as it comes in, not all at once.

Peter,

Check out the IncrementalParser class in the library module

Lib/xml/sax/xmlreader.py

This extension of the standard XMLReader class acts just like a SAX
parser, in that it delivers SAX2 events to your ContentHandler as it
processes the tokens from the source XML document.

But rather than the parser itself controlling when and how it gets its
input, you control that through the use of the .feed() method. So you
can "drip feed" the parser with input if you wish.

Not all XML parsers support an IncrementalParser interface. In order
for an XML parser to support incremental parsing, it must have been
coded specifically to do so. Fortunately, the expat wrapper supplied
with the base distribution of python does support incremental parsing.

Which I think should solve your problem quite nicely. When you start
up your process for the first time, feed() the IncrementalParser a
document element (all XML document must have one and only one document
element). Then simply feed the output of your logging stream directly
to the IncrementalParser, as and when you receive it.

You should not have any problems with XML tokens being split over two
different .feed() calls either. For example, this should work just
fine

ip = IncrementalParser()
ip.feed('<docu')
ip.feed('ment')
ip.feed('/>')

When your logging stream is closing, simply feed a close tag for your
document element to your IncrementalParser, and everything will clean
up nicely.

Here is some sample code:

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import xml.sax
from xml.sax.handler import ContentHandler

logentry = """
<channel name='#sandbox'>
    <message user='PeterScott'>Hello, my bot</message>
    <message user='PeterScott'>This is a message</message>
    <nickchange>
        <oldnick>PeterScott</oldnick>
        <newnick>PeterSc</newnick>
    </nickchange>
</channel>
"""

incr_parser = xml.sax.make_parser('xml.sax.expatreader')
incr_parser.setContentHandler(ContentHandler())
incr_parser.feed('<mylogstream>')
incr_parser.feed(logentry)
incr_parser.feed('</mylogstream>')
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

regards,

-- 
alan kennedy
-----------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan:              http://xhaus.com/mailto/alan




More information about the Python-list mailing list