Trying to parse a HUGE(1gb) xml file

Stefan Behnel stefan_ml at behnel.de
Tue Dec 21 03:31:50 EST 2010


spaceman-spiff, 20.12.2010 21:29:
> I am sorry i left out what exactly i am trying to do.
>
> 0. Goal :I am looking for a specific element..there are several 10s/100s occurrences of that element in the 1gb xml file.
> The contents of the xml, is just a dump of config parameters from a packet switch( although imho, the contents of the xml dont matter)
>
> I need to detect them&  then for each 1, i need to copy all the content b/w the element's start&  end tags&  create a smaller xml file.

Then cElementTree's iterparse() is your friend. It allows you to basically 
iterate over the XML tags while its building an in-memory tree from them. 
That way, you can either remove subtrees from the tree if you don't need 
them (to safe memory) or otherwise handle them in any way you like, such as 
serialising them into a new file (and then deleting them).

Also note that the iterparse implementation in lxml.etree allows you to 
specify a tag name to restrict the iterator to these tags. That's usually a 
lot faster, but it also means that you need to take more care to clean up 
the parts of the tree that the iterator stepped over. Depending on your 
requirements and the amount of manual code optimisation that you want to 
invest, either cElementTree or lxml.etree may perform better for you.

It seems that you already found the article by Liza Daly about high 
performance XML processing with Python. Give it another read, it has a 
couple of good hints and examples that will help you here.

Stefan




More information about the Python-list mailing list