[XML-SIG] What's New section on XML
A.M. Kuchling
akuchlin@mems-exchange.org
Wed, 11 Oct 2000 22:46:15 -0400
Here's draft text for a section that briefly discusses the new XML
support in Python 2.0. Criticisms and comments, please...
--amk
13 XML Modules
Python 1.5.2 included a simple XML parser in the form of the xmllib
module, contributed by Sjoerd Mullender. Since 1.5.2's release, two
different interfaces for processing XML have become common: SAX2
(version 2 of the Simple API for XML) provides an event-driven
interface with some similarities to xmllib, and the DOM (Document
Object Model) provides a tree-based interface, transforming an XML
document into a tree of nodes that can be traversed and modified.
Python 2.0 includes a SAX2 interface and a stripped-down DOM interface
as part of the xml package. Here we will give a brief overview of
these new interfaces; consult the Python documentation or the source
code for complete details. The Python XML SIG is also working on
improved documentation.
13.1 SAX2 Support
SAX defines an event-driven interface for parsing XML. To use SAX, you
must write a SAX handler class. Handler classes inherit from various
classes provided by SAX, and override various methods that will then
be called by the XML parser. For example, the startElement and
endElement methods are called for every starting and end tag
encountered by the parser, the characters() method is called for every
chunk of character data, and so forth.
The advantage of the event-driven approach is that that the whole
document doesn't have to be resident in memory at any one time, which
matters if you are processing really huge documents. However, writing
the SAX handler class can get very complicated if you're trying to
modify the document structure in some elaborate way.
For example, this little example program defines a handler that prints
a message for every starting and ending tag, and then parses the file
hamlet.xml using it:
from xml import sax
class SimpleHandler(sax.ContentHandler):
def startElement(self, name, attrs):
print 'Start of element:', name, attrs.keys()
def endElement(self, name):
print 'End of element:', name
# Create a parser object
parser = sax.make_parser()
# Tell it what handler to use
handler = SimpleHandler()
parser.setContentHandler( handler )
# Parse a file!
parser.parse( 'hamlet.xml' )
For more information, consult the Python documentation, or the XML
HOWTO at http://www.python.org/doc/howto/xml/.
13.2 DOM Support
The Document Object Model is a tree-based representation for an XML
document. A top-level Document instance is the root of the tree, and
has a single child which is the top-level Element instance. This
Element has children nodes representing character data and any
sub-elements, which may have further children of their own, and so
forth. Using the DOM you can traverse the resulting tree any way you
like, access element and attribute values, insert and delete nodes,
and convert the tree back into XML.
The DOM is useful for modifying XML documents, because you can create
a DOM tree, modify it by adding new nodes or rearranging subtrees, and
then produce a new XML document as output. You can also construct a
DOM tree manually and convert it to XML, which can be a more flexible
way of producing XML output than simply writing <tag1>...</tag1> to a
file.
The DOM implementation included with Python lives in the
xml.dom.minidom module. It's a lightweight implementation of the Level
1 DOM with support for XML namespaces. The parse() and parseString()
convenience functions are provided for generating a DOM tree:
from xml.dom import minidom
doc = minidom.parse('hamlet.xml')
doc is a Document instance. Document, like all the other DOM classes
such as Element and Text, is a subclass of the Node base class. All
the nodes in a DOM tree therefore support certain common methods, such
as toxml() which returns a string containing the XML representation of
the node and its children. Each class also has special methods of its
own; for example, Element and Document instances have a method to find
all child elements with a given tag name. Continuing from the previous
2-line example:
perslist = doc.getElementsByTagName( 'PERSONA' )
print perslist[0].toxml()
print perslist[1].toxml()
For the Hamlet XML file, the above few lines output:
<PERSONA>CLAUDIUS, king of Denmark. </PERSONA>
<PERSONA>HAMLET, son to the late, and nephew to the present king.</PERSONA>
The root element of the document is available as doc.documentElement,
and its children can be easily modified by deleting, adding, or
removing nodes:
root = doc.documentElement
# Remove the first child
root.removeChild( root.childNodes[0] )
# Move the new first child to the end
root.appendChild( root.childNodes[0] )
# Insert the new first child (originally,
# the third child) before the 20th child.
root.insertBefore( root.childNodes[0], root.childNodes[20] )
Again, I will refer you to the Python documentation for a complete
listing of the different Node classes and their various methods.
13.3 Relationship to PyXML
The XML Special Interest Group has been working on XML-related Python
code for a while. Its code distribution, called PyXML, is available
from the SIG's Web pages at http://www.python.org/sigs/xml-sig/. The
PyXML distribution also used the package name "xml". If you've written
programs that used PyXML, you're probably wondering about its
compatibility with the 2.0 xml package.
The answer is that Python 2.0's xml package isn't compatible with
PyXML, but can be made compatible by installing a recent version
PyXML. Many applications can get by with the XML support that is
included with Python 2.0, but more complicated applications will
require that the full PyXML package will be installed. When installed,
PyXML versions 0.6.0 or greater will replace the xml package shipped
with Python, and will be a strict superset of the standard package,
adding a bunch of additional features. Some of the additional features
in PyXML include:
* 4DOM, a full DOM implementation from FourThought LLC.
* The xmlproc validating parser, written by Lars Marius Garshol.
* The sgmlop parser accelerator module, written by Fredrik Lundh