python xml DOM? pulldom? SAX?

Alan Kennedy alanmk at hotmail.com
Mon Aug 29 14:58:09 EDT 2005


[Alan Kennedy]
 >>SAX is perfect for the job. See code below.

[Fredrik Lundh]
 > depends on your definition of perfect...

Obviously, perfect is the eye of the beholder ;-)

[Fredrik Lundh]
> using a 20 MB version of jog's sample, and having replaced
> the print statements with local variable assignments, I get the
> following timings:
> 
> 5 lines of cElementTree code: 7.2 seconds
> 60+ lines of xml.sax code: 63 seconds
> 
> (Python 2.4.1, Windows XP, Pentium 3 GHz)

Impressive!

At first, I thought your code sample was building a tree for the entire 
document, so I checked the API docs. It appeared to me that an event 
processing model *couldn't* obtain the text for the node when notified 
of the node: the text node is still in the future.

That's when I understood the nature of iterparse, which must generate an 
event *after* the node is complete, and it's subdocument reified. That's 
also when I understood the meaning of the "elem.clear()" call at the 
end. Only the required section of the tree is modelled in memory at any 
given time. Nice.

There are some minor inefficiencies in my pure python sax code, e.g. 
building the stack expression for every evaluation, but I left them in 
for didactic reasons. But even if every possible speed optimisation was 
added to my python code, I doubt it would be able to match your code.

I'm guessing that a lot of the reason why cElementTree performs best is 
because the model-building is primarily implemented in C: Both of our 
solutions run python code for every node in the tree, i.e. are O(N). But 
yours also avoids the overhead of having function-calls/stack-frames for 
every single node event, by processing all events inside a single function.

If the SAX algorithm were implemented in C (or Java) for that matter, I 
wonder if it might give comparable performance to the cElementTree code, 
primarily because the data structures it is building are simpler, 
compared to the tree-subsections being reified and discarded by 
cElementTree. But that's not of relevance, because we're looking for 
python solutions. (Aside: I can't wait to run my solution on a 
fully-optimising PyPy :-)

That's another nice thing I didn't know (c)ElementTree could do.

enlightened-ly'yrs,

-- 
alan kennedy
------------------------------------------------------
email alan:              http://xhaus.com/contact/alan



More information about the Python-list mailing list