python xml DOM? pulldom? SAX?

Mon Aug 29 12:44:50 EDT 2005

On 29 Aug 2005 08:17:04 -0700
"jog" <jo at johannageiss.de> wrote:
> I want to get text out of some nodes of a huge xml file (1,5 GB). The
> architecture of the xml file is something like this
> [structure snipped]
> I want to combine the text out of page:title and page:revision:text
> for every single page element. One by one I want to index these
> combined texts (so for each page one index)
> What is the most efficient API for that?: SAX ( I don´t thonk so) DOM
> or pulldom?

Definitely SAX IMHO, or xml.parsers.expat. For what you're doing, an
event-driven interface is ideal. DOM parses the *entire* XML tree into
memory at once, before you can do anything - highly inefficient for a
large data set like this. I've never used pulldom, it might have
potential, but from my (limited and flawed) understanding of it, I
think it may also wind up loading most of the file into memory by the
time you're done.

SAX will not build any memory structures other than the ones you
explicitly create (SAX is commonly used to build DOM trees). With SAX,
you can just watch for any tags of interest (and perhaps some
surrounding tags to provide context), extract the desired data, and all
that very efficiently.

It took me a bit to get the hang of SAX, but once I did, I haven't
looked back. Event-driven parsing is a brilliant solution to this
problem domain.

> Or should I just use Xpath somehow.

XPath usually requires a DOM tree on which it can operate. The Python
XPath implementation (in PyXML) requires DOM objects. I see this as
being a highly inefficient solution.

Another potential solution, if the data file has extraneous
information: run the source file through an XSLT transform that strips
it down to only the data you need, and then apply SAX to parse it.

- Michael