Trying to parse a HUGE(1gb) xml file

Tue Dec 28 02:22:28 EST 2010

Alan Meyer, 28.12.2010 01:29:
> On 12/27/2010 4:55 PM, Stefan Behnel wrote:
>> From my experience, SAX is only practical for very simple cases where
>> little state is involved when extracting information from the parse
>> events. A typical example is gathering statistics based on single tags -
>> not a very common use case. Anything that involves knowing where in the
>> XML tree you are to figure out what to do with the event is already too
>> complicated. The main drawback of SAX is that the callbacks run into
>> separate method calls, so you have to do all the state keeping manually
>> through fields of the SAX handler instance.
>>
>> My serious advices is: don't waste your time learning SAX. It's simply
>> too frustrating to debug SAX extraction code into existence. Given how
>> simple and fast it is to extract data with ElementTree's iterparse() in
>> a memory efficient way, there is really no reason to write complicated
>> SAX code instead.
>
> I confess that I hadn't been thinking about iterparse(). I presume that
> clear() is required with iterparse() if we're going to process files of
> arbitrary length.
>
> I should think that this approach provides an intermediate solution. It's
> more work than building the full tree in memory because the programmer has
> to do some additional housekeeping to call clear() at the right time and
> place. But it's less housekeeping than SAX.

The iterparse() implementation in lxml.etree allows you to intercept on a 
specific tag name, which is especially useful for large XML documents that 
are basically an endless sequence of (however deeply structured) top-level 
elements - arguably the most common format for gigabyte sized XML files. So 
what I usually do here is to intercept on the top level tag name, clear() 
that tag after use and leave it dangling around, like this:

     for _, element in ET.iterparse(source, tag='toptagname'):
         # ... work on the element and its subtree
         element.clear()

That allows you to write simple in-memory tree handling code (iteration, 
XPath, XSLT, whatever), while pushing the performance up (compared to ET's 
iterparse that returns all elements) and keeping the total amount of memory 
usage reasonably low. Even a series of several hundred thousand empty top 
level tags don't add up to anything that would truly hurt a decent machine.

In many cases where I know that the XML file easily fits into memory 
anyway, I don't even do any housekeeping at all. And the true advantage is: 
if you ever find that it's needed because the file sizes grow beyond your 
initial expectations, you don't have to touch your tested and readily 
debugged data extraction code, just add a suitable bit of cleanup code, or 
even switch from the initial all-in-memory parse() solution to an 
event-driven iterparse()+cleanup solution.

> I guess I've done enough SAX, in enough different languages, that I don't
> find it that onerous to use. When I need an element stack to keep track of
> things I can usually re-use code I've written for other applications. But
> for a programmer that doesn't do a lot of this stuff, I agree, the learning
> curve with lxml will be shorter and the programming and debugging can be
> faster.

I'm aware that SAX has the advantage of being available for more languages. 
But if you are in the lucky position to use Python for XML processing, why 
not just use the tools that it makes available?

Stefan