[Tutor] Trying to parse a HUGE(1gb) xml file in python

Steven D'Aprano steve at pearwood.info
Mon Dec 20 22:19:51 CET 2010


ashish makani wrote:

> Goal : I am trying to parse a ginormous ( ~ 1gb) xml file.

I sympathize with you. I wonder who thought that building a 1GB XML file 
was a good thing.

Forget about using any XML parser that reads the entire file into 
memory. By the time that 1GB of text is read and parsed, you will 
probably have something about 6-8GB (estimated) in size.


[...]
> My hardware setup : I have a win7 pro box with 8gb of RAM & intel core2 quad
> cpuq9400.

In order to access 8GB of RAM, you'll be running a 64-bit OS, correct? 
In this case, you should expect double the memory usage of the XML 
object to (estimated) 12-16GB.


> I am guessing, as this happens (over the course of 20-30 mins), the tree
> representing is being slowly built in memory, but even after 30-40 mins,
> nothing happens.

It's probably not finished. Leave it another hour or so and you'll get 
an out of memory error.


> 4. I then investigated some streaming libraries, but am confused - there is
> SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse
> interface[http://effbot.org/zone/element-iterparse.htm], & several otehr
> options ( minidom)
> 
> Which one is the best for my situation ?

You absolutely need to use a streaming library. element-iterparse still 
builds the tree, so that's no use to you. I believe you should use SAX 
or minidom, but that's about my limit of knowledge of streaming XML parsers.


> Should i instead just open the file, & use reg ex to look for the element i
> need ?

That's likely to need less memory than building a parse tree, but still 
a huge amount of memory. And you don't know how complex the XML is, in 
general you *can't* correctly parse arbitrary XML with regular 
expressions (although you can for simple examples). Stick with the right 
tool for the job, the streaming XML library.


-- 
Steven


More information about the Tutor mailing list