SaxRecords.py (was Re: busting-out XML sections)

Tue Oct 10 02:30:44 EDT 2000

Thomas Gagne wrote:
>I think what I'm beginning to picture inside my head is a combination
SAX/DOM
>parser.  Imagine how useful this would be for both large files and realtime
>data.  SAX would read the (unending) stream of data and my document handler
>would watch for the start and end tags of the useful subsections.  When the
>end-tag is reached it would somehow take the inbetween data and hand it off
to
>a DOM parser where the individual transactions are taken care of.

Interestingly enough, I've been thinking about what I think is a similar
thing, especially since it should help simplify my Martel work (see
biopython.org/~dalke/Martel/).  I wrote up a first draft of the module and
made it available at http://www.biopython.org/~dalke/SaxRecords.py .  Here's
what it looks like to use it:

    import SaxRecords
    from xml.sax import saxexts
    from xml.dom import sax_builder
    from StringIO import StringIO

    parser = saxexts.make_parser()
    test_data = """<doc>
<record><f>Andrew</f><l>Dalke</l><city>Santa Fe</city></record>
<record><f>Bill</f><l>Clinton</l><city>Washington</city></record>
<record><f>Craig</f><l>Vance</l><city>New York</city></record>
</doc>"""

    record_parser = SaxRecords.Parser(parser, "record",
sax_builder.SaxBuilder)
    for builder in record_parser.parseFile(StringIO(test_data)):
        doc = builder.document
        ... work with the DOM document ...

As you might see, I turned the interface into forward iterator by spawning
off a thread to handle the callbacks and send them back to the original
thread.

The package includes a slightly modified version of Sean McGrath's RAX
Record object as an alternate to producing DOM documents.

Also, it seems you'll have to tweak it a bit to work with PyXML-0.6.1, but
the basic concept should be viable.

                    Andrew Dalke
                    dalke at acm.org