Trying to parse a HUGE(1gb) xml file

Adam Tauno Williams awilliam at whitemice.org
Mon Dec 27 16:58:53 EST 2010


On Mon, 2010-12-27 at 22:55 +0100, Stefan Behnel wrote:
> Alan Meyer, 27.12.2010 21:40:
> > On 12/21/2010 3:16 AM, Stefan Behnel wrote:
> >> Adam Tauno Williams, 20.12.2010 20:49:
> > ...
> >>> You need to process the document as a stream of elements; aka SAX.
> >> IMHO, this is the worst advice you can give.
> > Why do you say that? I would have thought that using SAX in this
> > application is an excellent idea.
>  From my experience, SAX is only practical for very simple cases where 
> little state is involved when extracting information from the parse events. 
> A typical example is gathering statistics based on single tags - not a very 
> common use case. Anything that involves knowing where in the XML tree you 
> are to figure out what to do with the event is already too complicated.

I've found that using a stack-model makes traversing complex documents
with SAX quite manageable.  For example, I parse BPML files with SAX.
If the document is nested and context sensitive then I really don't see
how iterparse differs all that much.

> My serious advices is: don't waste your time learning SAX. It's simply too 
> frustrating to debug SAX extraction code into existence. Given how simple 
> and fast it is to extract data with ElementTree's iterparse() in a memory 
> efficient way, there is really no reason to write complicated SAX code instead.





More information about the Python-list mailing list