[Tutor] SAX

Thu Dec 18 17:20:42 EST 2003

On Thu, 18 Dec 2003, Scott wrote:

> Where is the best down and dirty SAX tutorial on the web?  I've never
> fooled around with XML much.  Isn't SAX what I want if I may be
> confronted with HUGE files, that I can't possibly load totally.  And I
> can just prod around and poke them?

Hi Scott,

If you're going to handle huge files, SAX will work.  An alternative is to
use something that partially constructs the XML tree structure on demand.
The 'pulldom' module is a good module to know about:

    http://www.python.org/doc/lib/module-xml.dom.pulldom.html

But the documentation on pulldom is filled with little '...' "fill-me-in"
sort of things, which is really not so good.  *sigh* There's slightly
better documentation on pulldom here:

    http://www.prescod.net/python/pulldom.html

The pulldom approach is nice because it's a hybrid between the "stream"
approach of SAX and the structured approach of the DOM.

As a concrete example, let's say we have the following XML text:

###
xmlText = """
<TU>
    <FEAT_NAME>68414.t07192</FEAT_NAME>
    <CHROMO_LINK>51530.t00029</CHROMO_LINK>
    <DATE>Aug 23 2001 12:27AM</DATE>
    <GENE_INFO>
        <LOCUS>F24J5.21</LOCUS>
        <PUB_LOCUS>At1g68680</PUB_LOCUS>
        <COM_NAME CURATED="1">expressed protein</COM_NAME>
        <IS_PSEUDOGENE>0</IS_PSEUDOGENE>
        <FUNCT_ANNOT_EVIDENCE TYPE="CURATED"></FUNCT_ANNOT_EVIDENCE>
        <DATE>Aug 23 2001 12:27AM</DATE>
    </GENE_INFO>
    <COORDSET>
        <END5>25789169</END5>
        <END3>25790587</END3>
    </COORDSET>
</TU>
"""
###

We'd like to traverse through this XML using some kind of parser. Here's
one approach, using the 'pulldom' module:

###
from xml.dom import pulldom
from StringIO import StringIO

xmlFile = StringIO(xmlText)
events = pulldom.parse(xmlFile)
for event, node in events:
    print event, node.nodeName
###

This has a sort of SAX-ish flavor: for every start tag, character data,
and end tag, we get back a 2-tuple that makes up the "event" and the
associated node.  The node that we get back is mostly contentless, but at
least we can look at the nodeName.

We can do more, though --- we can tell the system that we'd like to expand
a node so that we can get structural information from it.  For example:

###
from xml.dom import pulldom
from StringIO import StringIO

xmlFile = StringIO(xmlText)
events = pulldom.parse(xmlFile)
for event, node in events:
    if event == 'START_ELEMENT' and node.nodeName == 'COORDSET':
        events.expandNode(node)
        print node.toxml()
        print node.getElementsByTagName('END5')[0].childNodes
        print node.getElementsByTagName('END3')[0].childNodes
###

Once we've expanded a node, we can start dealing with it with DOMish glee.
*grin* The 'minidom' module and its discussion of the DOM applies pretty
well on an "expanded" node:

    http://www.python.org/doc/current/lib/dom-node-objects.html
    http://www.python.org/doc/current/lib/dom-example.html

The key to keeping memory usage down is to expand only the nodes that
we're interested in.  Instead of instantiating the whole tree at a time,
we can be content to play with the subtrees that we're interested in.

Please feel free to ask more questions about this.  Good luck to you!