Python parsing XML file problem with SAX

Aahz aahz at pythoncraft.com
Tue Aug 24 11:37:59 EDT 2010


In article <mailman.1895.1281422126.1673.python-list at python.org>,
Stefan Behnel  <stefan_ml at behnel.de> wrote:
>Christian Heimes, 10.08.2010 01:39:
>> Am 10.08.2010 01:20, schrieb Aahz:
>>> The docs say, "Parses an XML section into an element tree incrementally".
>>> Sure sounds like it retains the entire parsed tree in RAM.  Not good.
>>> Again, how do you parse an XML file larger than your available memory
>>> using something other than SAX?
>>
>> The document at
>> http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ explains it
>> one way.
>>
>> The iterparser approach is ingenious but it doesn't work for every XML
>> format. Let's say you have a 10 GB XML file with one million<part/>
>> tags. An iterparser doesn't load the entire document. Instead it
>> iterates over the file and yields (for example) one million ElementTrees
>> for each<part/>  tag and its children. You can get the nice API of
>> ElementTree with the memory efficiency of a SAX parser if you obey
>> "Listing 4".
>
>In the very common case that you are interested in all children of the root 
>element, it's even enough to intercept on the specific tag name (lxml.etree 
>has an option for that, but an 'if' block will do just fine in ET) and just 
>".clear()" the child element at the end of the loop body. That results in 
>very fast and simple code, but will leave the tags in the tree while only 
>removing their content and attributes. Usually works well enough for 
>several ten thousand elements, especially when using cElementTree.

Thanks to both of you!
-- 
Aahz (aahz at pythoncraft.com)           <*>         http://www.pythoncraft.com/

"...if I were on life-support, I'd rather have it run by a Gameboy than a
Windows box."  --Cliff Wells



More information about the Python-list mailing list