[XML-SIG] dumping an XML parser skeleton from DTD input

Eugene.Leitl@lrz.uni-muenchen.de Eugene.Leitl@lrz.uni-muenchen.de
Sat, 10 Mar 2001 11:29:34 +0100


"Martin v. Loewis" wrote:

> I get the feeling of being dumb here, since I still cannot understand
> what you are asking for. Let me interpret it word-by-word.

That's highly unlikely. I'm just trolling for clue, being forced
to learn XML in the course of a few days.

The company I'm with has the following ad hoc approach to XML: 
whip up some XML fitting the problem, don't bother with writing 
a DTD, code up a parser in an OO language, which recursively 
reads the tags into memory, creating a hierarchy/tree of objects. 
Fill in methods to deal with the data sitting in the tree, finis.

I looked at the way other people parse XML, and ran into DOM, which seemed
to imply the company has reinvented the wheel. I'm trying to understand
what Python DOM does (the regression test I ran yesterday did dump core, 
so I don't have a working installation up yet).
 
> You want to program that parses XML files: Well, there are plenty of
> XML parsers, I can recommend PyXML. It shall create a tree of objects
> ...  I recommend to use a parser that creates a DOM tree: That is a
> tree of objects.

Excellent. So, DOM parses the XML file (any well-formed XML file).
Because it is agnostic of what tags might be coming (since, as you
say, it doesn't need a DTD), it doesn't offer any hooks, calling a
matching method if a given tag is encountered.

So essentially, I wind up with a representation of the XML file
as tree of objects, which I process after the fact, right? Iirc,
DOM offers some helpful routines, allowing me to parse the tree.

So, where do I put my handler, interpreting the stuff as it passes
by? 

Let's say I have a reaction tree (molecule A is precursor of molecule B
is precursor of molecule C is educt of product Z) as result of a query. 
So building XML as a representation of it is quite natural, as it *is* 
a tree. I want to transform this into a variety of formats: mapping the
tree to a number of .png images layed out in a HTML table, or use a 
Tree Widget to paint a large bitmap, potentially with server-side 
clickable maps.

So, where does Python DOM offer me ways I can get at the data in
the object tree? 

> ... representing the structure of the XML file. That I cannot
> understand: Do you want the content of the XML file being represented
> by the tree of objects (i.e. the tag names of the elements, their
> attributes and attribute values, and strings for the text fragments in
> the elements)? That is what the DOM does. If this is not what you

This is what I need, yes.

> > It is a skeleton because it just does that, as lacking true
> > understanding of my further intentions it has no clue as what I'm
> > going to do with the data created from the parsing of the document,
> > so it has to leave the action field blank, to be filled out by me.
> 
> The DOM tree is good for that - it has no understanding of your plans
> to process the document.

Ok, very good, but where can I get at the data sitting there?
  
> from xml.dom.ext.reader import Sax2
> import sys
> doc = Sax.FromXmlFile(sys.argv[1])
> 
> This is a program that can read an XML document and build a tree of
> objects. The tree of objects is stored in a variable named doc. You
> can give a DTD to the first program, but it is ignored as it is not
> needed.
> 
> > No, I'm asking for a program that will dump a (skeleton of a, to be
> > filled in at earliest convenience) parser program, when supplied
> > with the DTD of the XML document.
> 
> The nice thing about XML is that you can parse it without a DTD, and
> that you can furthermore use the same parser for all XML documents.

Good, now I only need to get Python DOM pass the regression tests,
and find out how I can get at the data.