Trying to parse a HUGE(1gb) xml file

Nobody nobody at nowhere.com
Sat Dec 25 21:59:10 EST 2010


On Sun, 26 Dec 2010 01:05:53 +0000, Tim Harig wrote:

>> XML is typically processed sequentially, so you don't need to create a
>> decompressed copy of the file before you start processing it.
> 
> Sometimes XML is processed sequentially.  When the markup footprint is
> large enough it must be.  Quite often, as in the case of the OP, you only
> want to extract a small piece out of the total data.  In those cases,
> being forced to read all of the data sequentially is both inconvenient and
> and a performance penalty unless there is some way to address the data you
> want directly.

Actually, I should have said "must be processed sequentially". Even if you
only care about a small portion of the data, you have to read it
sequentially to locate that portion. IOW, anything you can do with
uncompressed XML can be done with compressed XML; you can't do random
access with either.

If XML has a drawback over application-specific formats, it's the
sequential nature of XML rather than its (uncompressed) size.

OTOH, formats designed for random access tend to be more limited in their
utility. You can only perform random access based upon criteria which
match the format's indexing. Once you step outside that, you often have to
walk the entire file anyhow.




More information about the Python-list mailing list