Trying to parse a HUGE(1gb) xml file

Tim Harig usernet at ilthio.net
Mon Dec 20 15:09:01 EST 2010


[Wrapped to meet RFC1855 Netiquette Guidelines]
On 2010-12-20, spaceman-spiff <ashish.makani at gmail.com> wrote:
> This is a rather long post, but i wanted to include all the details &
> everything i have tried so far myself, so please bear with me & read
> the entire boringly long post.
> 
> I am trying to parse a ginormous ( ~ 1gb) xml file.
[SNIP]
> 4. I then investigated some streaming libraries, but am confused - there
> is SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse
> interface[http://effbot.org/zone/element-iterparse.htm]

I have made extensive use of SAX and it will certainly work for low
memory parsing of XML.  I have never used "iterparse"; so, I cannot make
an informed comparison between them.

> Which one is the best for my situation ?

Your posed was long but it failed to tell us the most important piece
of information:  What does your data look like and what are you trying
to do with it?

SAX is a low level API that provides a callback interface allowing you to
processes various elements as they are encountered.  You can therefore
do anything you want to the information, as you encounter it, including
outputing and discarding small chunks as you processes it; ignoring
most of it and saving only what you want to memory data structures;
or saving all of it to a more random access database or on disk data
structure that you can load and process as required.

What you need to do will depend on what you are actually trying to
accomplish.  Without knowing that, I can only affirm that SAX will work
for your needs without providing any information about how you should
be using it.



More information about the Python-list mailing list