Python parsing XML file problem with SAX

Stefan Behnel stefan_ml at behnel.de
Tue Aug 10 02:35:15 EDT 2010


Christian Heimes, 10.08.2010 01:39:
> Am 10.08.2010 01:20, schrieb Aahz:
>> The docs say, "Parses an XML section into an element tree incrementally".
>> Sure sounds like it retains the entire parsed tree in RAM.  Not good.
>> Again, how do you parse an XML file larger than your available memory
>> using something other than SAX?
>
> The document at
> http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ explains it
> one way.
>
> The iterparser approach is ingenious but it doesn't work for every XML
> format. Let's say you have a 10 GB XML file with one million<part/>
> tags. An iterparser doesn't load the entire document. Instead it
> iterates over the file and yields (for example) one million ElementTrees
> for each<part/>  tag and its children. You can get the nice API of
> ElementTree with the memory efficiency of a SAX parser if you obey
> "Listing 4".

In the very common case that you are interested in all children of the root 
element, it's even enough to intercept on the specific tag name (lxml.etree 
has an option for that, but an 'if' block will do just fine in ET) and just 
".clear()" the child element at the end of the loop body. That results in 
very fast and simple code, but will leave the tags in the tree while only 
removing their content and attributes. Usually works well enough for 
several ten thousand elements, especially when using cElementTree.

As usual, a bit of benchmarking will uncover the right way to do it in your 
case. That's also a huge advantage over SAX: iterparse code is much easier 
to tune into a streamlined loop body when you need it.

Stefan




More information about the Python-list mailing list