[Tutor] Trying to parse a HUGE(1gb) xml file in python

Stefan Behnel stefan_ml at behnel.de
Tue Dec 21 09:44:10 CET 2010


[note that this has also been posted to comp.lang.python and discussed 
separately over there]

Steven D'Aprano, 20.12.2010 22:19:
> ashish makani wrote:
>
>> Goal : I am trying to parse a ginormous ( ~ 1gb) xml file.
>
> I sympathize with you. I wonder who thought that building a 1GB XML file
> was a good thing.
>
> Forget about using any XML parser that reads the entire file into memory.
> By the time that 1GB of text is read and parsed, you will probably have
> something about 6-8GB (estimated) in size.

The in-memory size is highly dependent on the data, specifically the 
text-to-structure ratio. If it's a lot of text content, the difference to 
the serialised tree will be small. If it's a lot of structure with tiny 
bits of text content, the in-memory size of the tree will be a lot larger.


>> I am guessing, as this happens (over the course of 20-30 mins), the tree
>> representing is being slowly built in memory, but even after 30-40 mins,
>> nothing happens.
>
> It's probably not finished. Leave it another hour or so and you'll get an
> out of memory error.

Right, if it gets into wild swapping, it can slow down almost to a halt, 
even though the XML parsing itself tends to have pretty good memory 
locality (but the ever growing in-memory tree obviously doesn't).


>> 4. I then investigated some streaming libraries, but am confused - there is
>> SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse
>> interface[http://effbot.org/zone/element-iterparse.htm], & several otehr
>> options ( minidom)
>>
>> Which one is the best for my situation ?
>
> You absolutely need to use a streaming library. element-iterparse still
> builds the tree, so that's no use to you.

Wrong. iterparse() allows you to cut branches in the tree while it's 
growing, that's exactly what it's there for.


> I believe you should use SAX or
> minidom, but that's about my limit of knowledge of streaming XML parsers.

With "minidom" being an advice that's even worse than SAX - SAX would at 
least solve the problem, whereas minidom wouldn't because of its 
intolerable memory requirements.

Stefan



More information about the Tutor mailing list