Trying to parse a HUGE(1gb) xml file

Mon Dec 27 15:40:32 EST 2010

On 12/21/2010 3:16 AM, Stefan Behnel wrote:
> Adam Tauno Williams, 20.12.2010 20:49:
...
>> You need to process the document as a stream of elements; aka SAX.
>
> IMHO, this is the worst advice you can give.

Why do you say that?  I would have thought that using SAX in this 
application is an excellent idea.

I agree that for applications for which performance is not a problem, 
and for which we need to examine more than one or a few element types, a 
tree implementation is more functional, less programmer intensive, and 
provides an easier to understand approach to the data.  But with huge 
amounts of data where performance is a problem SAX will be far more 
practical.  In the special case where only a few elements are of 
interest in a complex tree, SAX can sometimes also be more natural and 
easy to use.

SAX might also be more natural for this application.  The O.P. could 
tell us for sure, but I wonder if perhaps his 1 GB XML file is NOT a 
true single record.  You can store an entire text encyclopedia in less 
than one GB.  What he may have is a large number logically distinct 
individual records of some kind, each stored as a node in an 
all-encompassing element wrapper.  Building a tree for each record could 
make sense but, if I'm right about the nature of the data, building a 
tree for the wrapper gives very little return for the high cost.

If that's so, then I'd recommend one of two approaches:

1. Use SAX, or

2. Parse out individual logical records using string manipulation on an 
input stream, then build a tree for one individual record in memory 
using one of the DOM or ElementTree implementations.  After each record 
is processed, discard its tree and start on the next record.

     Alan