Trying to parse a HUGE(1gb) xml file

Alan Meyer ameyer2 at yahoo.com
Mon Dec 27 15:40:32 EST 2010


On 12/21/2010 3:16 AM, Stefan Behnel wrote:
> Adam Tauno Williams, 20.12.2010 20:49:
...
>> You need to process the document as a stream of elements; aka SAX.
>
> IMHO, this is the worst advice you can give.

Why do you say that?  I would have thought that using SAX in this 
application is an excellent idea.

I agree that for applications for which performance is not a problem, 
and for which we need to examine more than one or a few element types, a 
tree implementation is more functional, less programmer intensive, and 
provides an easier to understand approach to the data.  But with huge 
amounts of data where performance is a problem SAX will be far more 
practical.  In the special case where only a few elements are of 
interest in a complex tree, SAX can sometimes also be more natural and 
easy to use.

SAX might also be more natural for this application.  The O.P. could 
tell us for sure, but I wonder if perhaps his 1 GB XML file is NOT a 
true single record.  You can store an entire text encyclopedia in less 
than one GB.  What he may have is a large number logically distinct 
individual records of some kind, each stored as a node in an 
all-encompassing element wrapper.  Building a tree for each record could 
make sense but, if I'm right about the nature of the data, building a 
tree for the wrapper gives very little return for the high cost.

If that's so, then I'd recommend one of two approaches:

1. Use SAX, or

2. Parse out individual logical records using string manipulation on an 
input stream, then build a tree for one individual record in memory 
using one of the DOM or ElementTree implementations.  After each record 
is processed, discard its tree and start on the next record.

     Alan



More information about the Python-list mailing list