Trying to parse a HUGE(1gb) xml file

Stefan Sonnenberg-Carstens stefan.sonnenberg at pythonmeister.com
Wed Dec 22 17:54:34 EST 2010


Am 20.12.2010 20:34, schrieb spaceman-spiff:
> Hi c.l.p folks
>
> This is a rather long post, but i wanted to include all the details&  everything i have tried so far myself, so please bear with me&  read the entire boringly long post.
>
> I am trying to parse a ginormous ( ~ 1gb) xml file.
>
>
> 0. I am a python&  xml n00b, s&  have been relying on the excellent beginner book DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if u are readng this, you are AWESOME&  so is your witty&  humorous writing style)
>
>
> 1. Almost all exmaples pf parsing xml in python, i have seen, start off with these 4 lines of code.
>
> import xml.etree.ElementTree as etree
> tree = etree.parse('*path_to_ginormous_xml*')
> root = tree.getroot()  #my huge xml has 1 root at the top level
> print root
>
> 2. In the 2nd line of code above, as Mark explains in DIP, the parse function builds&  returns a tree object, in-memory(RAM), which represents the entire document.
> I tried this code, which works fine for a small ( ~ 1MB), but when i run this simple 4 line py code in a terminal for my HUGE target file (1GB), nothing happens.
> In a separate terminal, i run the top command,&  i can see a python process, with memory (the VIRT column) increasing from 100MB , all the way upto 2100MB.
>
> I am guessing, as this happens (over the course of 20-30 mins), the tree representing is being slowly built in memory, but even after 30-40 mins, nothing happens.
> I dont get an error, seg fault or out_of_memory exception.
>
> My hardware setup : I have a win7 pro box with 8gb of RAM&  intel core2 quad cpuq9400.
> On this i am running sun virtualbox(3.2.12), with ubuntu 10.10 as guest os, with 23gb disk space&  2gb(2048mb) ram, assigned to the guest ubuntu os.
>
> 3. I also tried using lxml, but an lxml tree is much more expensive, as it retains more info about a node's context, including references to it's parent.
> [http://www.ibm.com/developerworks/xml/library/x-hiperfparse/]
>
> When i ran the same 4line code above, but with lxml's elementree ( using the import below in line1of the code above)
> import lxml.etree as lxml_etree
>
> i can see the memory consumption of the python process(which is running the code) shoot upto ~ 2700mb&  then, python(or the os ?) kills the process as it nears the total system memory(2gb)
>
> I ran the code from 1 terminal window (screenshot :http://imgur.com/ozLkB.png)
> &  ran top from another terminal (http://imgur.com/HAoHA.png)
>
> 4. I then investigated some streaming libraries, but am confused - there is SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse interface[http://effbot.org/zone/element-iterparse.htm]
>
> Which one is the best for my situation ?
>
> Any&  all code_snippets/wisdom/thoughts/ideas/suggestions/feedback/comments/ of the c.l.p community would be greatly appreciated.
> Plz feel free to email me directly too.
>
> thanks a ton
>
> cheers
> ashish
>
> email :
> ashish.makani
> domain:gmail.com
>
> p.s.
> Other useful links on xml parsing in python
> 0. http://diveintopython3.org/xml.html
> 1. http://stackoverflow.com/questions/1513592/python-is-there-an-xml-parser-implemented-as-a-generator
> 2. http://codespeak.net/lxml/tutorial.html
> 3. https://groups.google.com/forum/?hl=en&lnk=gst&q=parsing+a+huge+xml#!topic/comp.lang.python/CMgToEnjZBk
> 4. http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
> 5.http://effbot.org/zone/element-index.htm
> http://effbot.org/zone/element-iterparse.htm
> 6. SAX : http://en.wikipedia.org/wiki/Simple_API_for_XML
>
>
Normally (what is normal, anyway?) such files are auto-generated,
and are something that has a apparent similarity with a database query 
result, encapsuled in xml.
Most of the time the structure is same for every "row" thats in there.
So, a very unpythonic but fast, way would be to let awk resemble the 
records and write them in csv format to stdout.
then pipe that to your python cruncher of choice and let it do the hard 
work.
The awk part can be done in python, anyway, so could skip that.

And take a look at xmlsh.org, they offer tools for the command line, 
like xml2csv. (Need java, btw).

Cheers




More information about the Python-list mailing list