[XML-SIG] Showing off the power of Python for XML processing

Stefane Fermigier fermigie@math.jussieu.fr
Tue, 17 Mar 1998 14:08:42 +0100


On Tue, Mar 17, 1998 at 12:29:40PM +0000, Sean Mc Grath wrote:
> 
> As anyone who has read the article in Dobbs in Feb. will know,

I did, and your book (Parseme.1st) too.

> I made a stab at inventing a native Python data structure for
> representing the tree structure of an XML document. I would
> like to see some discussion as to how best to expose this
> tree structure for Python applications. Obviously, DOM will
> be one interface but should we limit ourselves to it?

I my DOM package, I did represent the tree structure a bit differently than
you did (I use lists where you use pointers to previous and next sibling).
I guess that once an API has been chosen, the difference is minor.

> Whether we like it or not, developers selecting scripting
> languages for XML processing are going to perform line count
> comparisons. I think it would be great to be able to show
> how Python code can be  a)succinct, b) understandable and
> c) maintainable for XML processing.

Absolutely.

> Some arbitrary notions:-
> 
> 1) Iterators
> 
> In the Python article for Dobbs I provided a __getitem__
> at the XML tree level to allow:-
> 
>         for ANode in ATree:
>                 Do something
> 
> Good or bad? Would it be better Pythoneze to create a list
> of nodes an iterate that?
>         MyNodeList = ATree.GetDescendants()
>         for n in MyNodeList:
>                 Do something

The order in which you traverse a tree is significant, there is no reason
(I guess) to promote one instead on another. I would rather use an iterator:

	for node in tree.top_down_iterator():
		do_something(node)

> 2) Slice operations
> 
> I think this is one of the areas where Python can really
> shine for XML processing. I write a lot of XML processing
> apps and a lot of the processing is driven by context:
> 
> "if my parent is a SECT and my grandparent is a CHAP:
> 
> if GetAncestors()[1:3] = ("SECT","CHAP"):
>         do something
> 
> A health collection of primitives for creating such lists
> combined with Pythons list processing, slicing functionality
> is *mouth watering*.

Do you mean: 
	node.GetAncestors()
or:
	GetAncestors(node)
?

(the difference is only syntactical, but means 

> 3) Collection Processing
> 
> Rarely do any of my XML processing apps stand alone. By
> that I mean that they tend to process a collection of XML docs.
> 
> // Process all chap*.xml docs. Print data content of 
> // foo elements
> 
> for f in glob.glob ("chap*.xml"):
>         for t in LoadXML(f):
>                 if t.AtElement ("FOO"):
>                         print GetDataDescendants()
> 
> How best to do it?

Here's how you would do it in my current framwork

for f in glob.glob ("chap*.xml"):
	p = XmlParser()
	document = p.parse('', f)
	# or: p.parse('', f); document = p.document
	t = MyTranformer()
	t.tranform(document)
	# or: document = t.tranform(document)
	etc...

> 4) TreeApply
> 
> One trick that I have found very useful is an apply() style
> helper function for trees.
> 
> I have an XMLTreeApply helper function that walks an XML
> tree applying the supplied function to all the nodes in the tree.
> It proves particular useful for throwaway lambda functions
> 
> XMLTreeApply (lamdba x:if x.AtElement("FOO"): print GetDataDescendants())

Yes, that's cool, but it exposes one of the drawback of Python wrt scheme:
you can't use instructions in lambda expressions.

Another option is to use a query fonction:

for node in document.query_descendants('this.GI == "FOO"'):
	...

or:

for node in document.query_descendants(lambda x: x.GI == 'FOO'):
	...

(not implemented currently)

or:

for node in document.getElementsByTagName('FOO'):
	...

^-- this one in in the DOM core specs, and I guess the W3C is working on
an extension of this mecanism.

> 5) Exposing XML from non-XML data sources
> 
> It is only a matter of time before relational databases and so on natively
> provide functionality to expose their data as XML. In the mean-time
> wouldn't it be useful if dbm, glob, pstats and even calendar exposed
> XML?

Probably.


One related point: I'm still working on the tranformation engine included
in my DOM package. I have two options: use XSL (i.e. compile XSL stylesheets
into Tranformer classes), but I personnally dislike XSL, or invent a new
transformation language.

What do you think ? (I can guess the answer: XSL is a standard, blah, blah, 
but ECMAScript is the standard scripting language of XSL, and there is no
current implementation of ECMAScript in Python that I'm aware of. So even
if we can compile basic XSL in Python, we can't cope with embeded JavaScript
without hand-translation).

Cheers,

	S.

-- 
Stéfane Fermigier, MdC à l'Université Paris 7. Tel: 01.44.27.61.01 (Bureau).
Mathematician, hacker, bassist.  http://www.math.jussieu.fr/~fermigie/
"Life is good for only two things, discovering mathematics and teaching 
mathematics." Siméon Poisson.