a few more questions on XML and python

Thu Jan 3 23:54:22 EST 2002

Rajarshi Guha <rxg218 at psu.edu> wrote:

> As you can see I'm pretty confused :(

  Why not use DOM instead of SAX?  XML is clearly tree-based, so
processing it in terms of trees (DOM) tends to be more easily
comprehensible than processing it in terms of streams and event
handlers (SAX).  SAX has definite resource utilization advantages over
DOM for read-only XML access, but I daresay that working code can eat
a great deal of memory and still perform better than a dysfunctional
collection of highly optimized fragments :)

  At its simplest, DOM parsing consists of setting up a series of
nested node iterations to reach the information you want, as in
(Python 2.2):
--------------------------------------------------------------
from StringIO import StringIO
import xml.dom.minidom as dom

theXML = """<?xml version="1.0"?>
<book edition="1">
    <title>The Fascist Menace</title>
    <authors>
        <author>
            <name_first>Josef</name_first>
            <name_first>Stalin</name_first>
            <email>unclejoe at kremlin.ru</email>
        </author>
        <author>
            <name_first>Alexei</name_first>
            <name_first>Voloshnikov</name_first>
            <email>just-another-tovarisch at kremlin.ru</email>
        </author>
    </authors>
    <description><![CDATA[Our fearless leader's electrifying call to
action.]]></description>
    <pages>1941</pages>
</book>
"""

doc = dom.parse(StringIO(theXML))

book = doc.documentElement
print 'edition:', book.attributes['edition'].value

# It is the first child (a text node) of the title element,
# rather than the title element itself, that contains the value.
title = book.getElementsByTagName('title')[0].firstChild.nodeValue
print 'title:', title

authors = book.getElementsByTagName('authors')[0]
for author in authors.getElementsByTagName('author'):
    print "an author's email:",
    print author.getElementsByTagName('email')[0].firstChild.nodeValue

description = book.getElementsByTagName('description')[0].firstChild.nodeValue
print "The critics say [well, aside from omygodisthatakgvbofficer?]:",
print description
--------------------------------------------------------------

  Although that code is totally lax about error handling and quite
inefficient (overuse of getElementsByTagName, etc.), it's also brief,
and hopefully more comprehensible that the SAX you've been trying to
write.

  If you're processing huge data sets, DOM isn't going to cut it.  DOM
builds an in-memory representation of the entire document, whereas SAX
handles a single element at a time.  But after you have a bit of
Python and XML parsing under your belt, you can always move on to SAX
if necessary, eh?