Trying to parse a HUGE(1gb) xml file

Tim Harig usernet at ilthio.net
Sat Dec 25 20:05:53 EST 2010


On 2010-12-25, Nobody <nobody at nowhere.com> wrote:
> On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote:
>>> XML works extremely well for large datasets.
> One advantage it has over many legacy formats is that there are no
> inherent 2^31/2^32 limitations. Many binary formats inherently cannot
> support files larger than 2GiB or 4Gib due to the use of 32-bit offsets in
> indices.

That is probably true of many older and binary formats; but, XML
is certainly not the the only format that supports arbitrary size.
It certainly doesn't prohibit another format with better handling of
large data sets from being developed.  XML's primary benefit is its
ubiquity.  While it is an excellent format for a number of uses, I don't
accept ubiquity as the only or preeminent metric when choosing a data
format.

>> Of course, one advantage of XML is that with so much redundant text, it 
>> compresses well.  We typically see gzip compression ratios of 20:1.  
>> But, that just means you can archive them efficiently; you can't do 
>> anything useful until you unzip them.
>
> XML is typically processed sequentially, so you don't need to create a
> decompressed copy of the file before you start processing it.

Sometimes XML is processed sequentially.  When the markup footprint is
large enough it must be.  Quite often, as in the case of the OP, you only
want to extract a small piece out of the total data.  In those cases, being
forced to read all of the data sequentially is both inconvenient and and a
performance penalty unless there is some way to address the data you want
directly.



More information about the Python-list mailing list