[XML-SIG] What's New section on XML

A.M. Kuchling akuchlin@mems-exchange.org
Wed, 11 Oct 2000 22:46:15 -0400


Here's draft text for a section that briefly discusses the new XML
support in Python 2.0.  Criticisms and comments, please...

--amk

                                13 XML Modules
                                       
   Python 1.5.2 included a simple XML parser in the form of the xmllib
   module, contributed by Sjoerd Mullender. Since 1.5.2's release, two
   different interfaces for processing XML have become common: SAX2
   (version 2 of the Simple API for XML) provides an event-driven
   interface with some similarities to xmllib, and the DOM (Document
   Object Model) provides a tree-based interface, transforming an XML
   document into a tree of nodes that can be traversed and modified.
   Python 2.0 includes a SAX2 interface and a stripped-down DOM interface
   as part of the xml package. Here we will give a brief overview of
   these new interfaces; consult the Python documentation or the source
   code for complete details. The Python XML SIG is also working on
   improved documentation.
   
13.1 SAX2 Support

   SAX defines an event-driven interface for parsing XML. To use SAX, you
   must write a SAX handler class. Handler classes inherit from various
   classes provided by SAX, and override various methods that will then
   be called by the XML parser. For example, the startElement and
   endElement methods are called for every starting and end tag
   encountered by the parser, the characters() method is called for every
   chunk of character data, and so forth.
   
   The advantage of the event-driven approach is that that the whole
   document doesn't have to be resident in memory at any one time, which
   matters if you are processing really huge documents. However, writing
   the SAX handler class can get very complicated if you're trying to
   modify the document structure in some elaborate way.
   
   For example, this little example program defines a handler that prints
   a message for every starting and ending tag, and then parses the file
   hamlet.xml using it:
   
from xml import sax

class SimpleHandler(sax.ContentHandler):
    def startElement(self, name, attrs):
        print 'Start of element:', name, attrs.keys()

    def endElement(self, name):
        print 'End of element:', name

# Create a parser object
parser = sax.make_parser()

# Tell it what handler to use
handler = SimpleHandler()
parser.setContentHandler( handler )

# Parse a file!
parser.parse( 'hamlet.xml' )

   For more information, consult the Python documentation, or the XML
   HOWTO at http://www.python.org/doc/howto/xml/.
   
13.2 DOM Support

   The Document Object Model is a tree-based representation for an XML
   document. A top-level Document instance is the root of the tree, and
   has a single child which is the top-level Element instance. This
   Element has children nodes representing character data and any
   sub-elements, which may have further children of their own, and so
   forth. Using the DOM you can traverse the resulting tree any way you
   like, access element and attribute values, insert and delete nodes,
   and convert the tree back into XML.
   
   The DOM is useful for modifying XML documents, because you can create
   a DOM tree, modify it by adding new nodes or rearranging subtrees, and
   then produce a new XML document as output. You can also construct a
   DOM tree manually and convert it to XML, which can be a more flexible
   way of producing XML output than simply writing <tag1>...</tag1> to a
   file.
   
   The DOM implementation included with Python lives in the
   xml.dom.minidom module. It's a lightweight implementation of the Level
   1 DOM with support for XML namespaces. The parse() and parseString()
   convenience functions are provided for generating a DOM tree:
   
from xml.dom import minidom
doc = minidom.parse('hamlet.xml')

   doc is a Document instance. Document, like all the other DOM classes
   such as Element and Text, is a subclass of the Node base class. All
   the nodes in a DOM tree therefore support certain common methods, such
   as toxml() which returns a string containing the XML representation of
   the node and its children. Each class also has special methods of its
   own; for example, Element and Document instances have a method to find
   all child elements with a given tag name. Continuing from the previous
   2-line example:
   
perslist = doc.getElementsByTagName( 'PERSONA' )
print perslist[0].toxml()
print perslist[1].toxml()

   For the Hamlet XML file, the above few lines output:
   
<PERSONA>CLAUDIUS, king of Denmark. </PERSONA>
<PERSONA>HAMLET, son to the late, and nephew to the present king.</PERSONA>

   The root element of the document is available as doc.documentElement,
   and its children can be easily modified by deleting, adding, or
   removing nodes:
   
root = doc.documentElement

# Remove the first child
root.removeChild( root.childNodes[0] )

# Move the new first child to the end
root.appendChild( root.childNodes[0] )

# Insert the new first child (originally,
# the third child) before the 20th child.
root.insertBefore( root.childNodes[0], root.childNodes[20] )

   Again, I will refer you to the Python documentation for a complete
   listing of the different Node classes and their various methods.
   
13.3 Relationship to PyXML

   The XML Special Interest Group has been working on XML-related Python
   code for a while. Its code distribution, called PyXML, is available
   from the SIG's Web pages at http://www.python.org/sigs/xml-sig/. The
   PyXML distribution also used the package name "xml". If you've written
   programs that used PyXML, you're probably wondering about its
   compatibility with the 2.0 xml package.
   
   The answer is that Python 2.0's xml package isn't compatible with
   PyXML, but can be made compatible by installing a recent version
   PyXML. Many applications can get by with the XML support that is
   included with Python 2.0, but more complicated applications will
   require that the full PyXML package will be installed. When installed,
   PyXML versions 0.6.0 or greater will replace the xml package shipped
   with Python, and will be a strict superset of the standard package,
   adding a bunch of additional features. Some of the additional features
   in PyXML include:
   
     * 4DOM, a full DOM implementation from FourThought LLC.
     * The xmlproc validating parser, written by Lars Marius Garshol.
     * The sgmlop parser accelerator module, written by Fredrik Lundh