Trying to parse a HUGE(1gb) xml file

Mon Dec 20 14:49:46 EST 2010

On Mon, 2010-12-20 at 11:34 -0800, spaceman-spiff wrote:
> Hi c.l.p folks
> This is a rather long post, but i wanted to include all the details &
> everything i have tried so far myself, so please bear with me & read
> the entire boringly long post.
> I am trying to parse a ginormous ( ~ 1gb) xml file.

Do that hundreds of times a day.

> 0. I am a python & xml n00b, s& have been relying on the excellent
> beginner book DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if
> u are readng this, you are AWESOME & so is your witty & humorous
> writing style)
> 1. Almost all exmaples pf parsing xml in python, i have seen, start off with these 4 lines of code.
> import xml.etree.ElementTree as etree
> tree = etree.parse('*path_to_ginormous_xml*')
> root = tree.getroot()  #my huge xml has 1 root at the top level
> print root

Yes, this is a terrible technique;  most examples are crap.

> 2. In the 2nd line of code above, as Mark explains in DIP, the parse
> function builds & returns a tree object, in-memory(RAM), which
> represents the entire document.
> I tried this code, which works fine for a small ( ~ 1MB), but when i
> run this simple 4 line py code in a terminal for my HUGE target file
> (1GB), nothing happens.
> In a separate terminal, i run the top command, & i can see a python
> process, with memory (the VIRT column) increasing from 100MB , all the
> way upto 2100MB.

Yes, this is using DOM.  DOM is evil and the enemy, full-stop.

> I am guessing, as this happens (over the course of 20-30 mins), the
> tree representing is being slowly built in memory, but even after
> 30-40 mins, nothing happens.
> I dont get an error, seg fault or out_of_memory exception.

You need to process the document as a stream of elements; aka SAX.

> 3. I also tried using lxml, but an lxml tree is much more expensive,
> as it retains more info about a node's context, including references
> to it's parent.
> [http://www.ibm.com/developerworks/xml/library/x-hiperfparse/]
> When i ran the same 4line code above, but with lxml's elementree
> ( using the import below in line1of the code above)
> import lxml.etree as lxml_etree

You're still using DOM; DOM is evil.

> Which one is the best for my situation ?
> Any & all
> code_snippets/wisdom/thoughts/ideas/suggestions/feedback/comments/ of
> the c.l.p community would be greatly appreciated.
> Plz feel free to email me directly too.

<http://docs.python.org/library/xml.sax.html>

<http://coils.hg.sourceforge.net/hgweb/coils/coils/file/62335a211fda/src/coils/foundation/standard_xml.py>