[XML-SIG] dumping an XML parser skeleton from DTD input

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Sat, 10 Mar 2001 14:24:02 +0100


> I looked at the way other people parse XML, and ran into DOM, which seemed
> to imply the company has reinvented the wheel. I'm trying to understand
> what Python DOM does (the regression test I ran yesterday did dump core, 
> so I don't have a working installation up yet).

What operating system, what version of Python and PyXML? Python should
*never* coredump; at worst, you might get Python exceptions.

> Excellent. So, DOM parses the XML file (any well-formed XML file).

Indeed. You have the choice of either a validating parser (on that
looks at the DOCTYPE declaration in the document, and complains when
elements are used incorrectly), and a non-validating parser, one that
looks only for well-formedness.

In either case, you get the same DOM tree (well, almost - a validating
parser may fill in DEFAULT values of attributes from the DTD; a
non-validating parser won't normally).

> Because it is agnostic of what tags might be coming (since, as you
> say, it doesn't need a DTD), it doesn't offer any hooks, calling a
> matching method if a given tag is encountered.

Yes and no. The DOM does not call any callbacks. Instead, you give the
parser the document URL, and it gives you back a DOM tree; no
application interaction during parsing.

If you want event-oriented XML processing, you should study the SAX
interface. This calls your callback for every start and end tag, text
nodes, and so on. It does not build any kind of tree. In many XML
libraries, it is possible to implement a "DOM builder" on top of a
"SAX parser"; this is in fact how PyXML operates.

> So essentially, I wind up with a representation of the XML file
> as tree of objects, which I process after the fact, right?

Exactly.

> Iirc, DOM offers some helpful routines, allowing me to parse the
> tree.

Yes, depending on what exactly you want to do with the tree; not all
routines are helpful for all applications.

> So, where do I put my handler, interpreting the stuff as it passes
> by? 

You don't, unless you implement your own SAX content handler - which
either might or might not chose to build a DOM tree.

> I want to transform this into a variety of formats: mapping the
> tree to a number of .png images layed out in a HTML table, or use a 
> Tree Widget to paint a large bitmap, potentially with server-side 
> clickable maps.
> 
> So, where does Python DOM offer me ways I can get at the data in
> the object tree? 

The DOM itself offers standard accessor functions - they are not only
standard across Python DOM implementations, but also standard across
programming languages.

The "DOM Core" interface only provides accessor functions to
"navigate" the tree: Give me the name of the element (elem.tagName);
give me all the children (elem.childNodes), give me the next sibling,
give me the attribute named "atomWeight". There are some query
functions: give me all element nodes with a certain element name, ...

"DOM 2 Navigation" offers traversal interfaces. You might be tempted
to use those, but I suggest to work with the core interfaces only at
first; you'll find that it is quite easy to do your own traversal with
just the accessor functions.

Depending on the output format, it might be easy to write a SAX
ContentHandler.

Alternatively, if you can describe the output in terms of "for every
foo element write bar, then go to the child nodes, then write foobar",
it might be that XSLT is the right transformation language. There is
no single best way to process XML - the only rule is that nobody ever
writes his own parser, since that's already done.

> Good, now I only need to get Python DOM pass the regression tests,
> and find out how I can get at the data.

I'd rather recommend to look at the demos. It may be indeed that some
tests fail, e.g. when running PyXML on Python 1.5, which does not
support Unicode strings.

Regards,
Martin