Trying to parse a HUGE(1gb) xml file

Fri Dec 31 19:55:05 EST 2010

You should look into vtd-xml, available in c, c++, java and c#.

On Dec 20, 11:34 am, spaceman-spiff <ashish.mak... at gmail.com> wrote:
> Hi c.l.p folks
>
> This is a rather long post, but i wanted to include all the details & everything i have tried so far myself, so please bear with me & read the entire boringly long post.
>
> I am trying to parse a ginormous ( ~ 1gb) xml file.
>
> 0. I am a python & xml n00b, s& have been relying on the excellent beginner book DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if u are readng this, you are AWESOME & so is your witty & humorous writing style)
>
> 1. Almost all exmaples pf parsing xml in python, i have seen, start off with these 4 lines of code.
>
> import xml.etree.ElementTree as etree
> tree = etree.parse('*path_to_ginormous_xml*')
> root = tree.getroot()  #my huge xml has 1 root at the top level
> print root
>
> 2. In the 2nd line of code above, as Mark explains in DIP, the parse function builds & returns a tree object, in-memory(RAM), which represents the entire document.
> I tried this code, which works fine for a small ( ~ 1MB), but when i run this simple 4 line py code in a terminal for my HUGE target file (1GB), nothing happens.
> In a separate terminal, i run the top command, & i can see a python process, with memory (the VIRT column) increasing from 100MB , all the way upto 2100MB.
>
> I am guessing, as this happens (over the course of 20-30 mins), the tree representing is being slowly built in memory, but even after 30-40 mins, nothing happens.
> I dont get an error, seg fault or out_of_memory exception.
>
> My hardware setup : I have a win7 pro box with 8gb of RAM & intel core2 quad cpuq9400.
> On this i am running sun virtualbox(3.2.12), with ubuntu 10.10 as guest os, with 23gb disk space & 2gb(2048mb) ram, assigned to the guest ubuntu os.
>
> 3. I also tried using lxml, but an lxml tree is much more expensive, as it retains more info about a node's context, including references to it's parent.
> [http://www.ibm.com/developerworks/xml/library/x-hiperfparse/]
>
> When i ran the same 4line code above, but with lxml's elementree ( using the import below in line1of the code above)
> import lxml.etree as lxml_etree
>
> i can see the memory consumption of the python process(which is running the code) shoot upto ~ 2700mb & then, python(or the os ?) kills the process as it nears the total system memory(2gb)
>
> I ran the code from 1 terminal window (screenshot :http://imgur.com/ozLkB.png)
> & ran top from another terminal (http://imgur.com/HAoHA.png)
>
> 4. I then investigated some streaming libraries, but am confused - there is SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse interface[http://effbot.org/zone/element-iterparse.htm]
>
> Which one is the best for my situation ?
>
> Any & all code_snippets/wisdom/thoughts/ideas/suggestions/feedback/comments/ of the c.l.p community would be greatly appreciated.
> Plz feel free to email me directly too.
>
> thanks a ton
>
> cheers
> ashish
>
> email :
> ashish.makani
> domain:gmail.com
>
> p.s.
> Other useful links on xml parsing in python
> 0.http://diveintopython3.org/xml.html
> 1.http://stackoverflow.com/questions/1513592/python-is-there-an-xml-par...
> 2.http://codespeak.net/lxml/tutorial.html
> 3.https://groups.google.com/forum/?hl=en&lnk=gst&q=parsing+a+huge+xml#!...
> 4.http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
> 5.http://effbot.org/zone/element-index.htmhttp://effbot.org/zone/element-iterparse.htm
> 6. SAX :http://en.wikipedia.org/wiki/Simple_API_for_XML