Trying to parse a HUGE(1gb) xml file

Stefan Behnel stefan_ml at behnel.de
Sun Dec 26 03:32:04 EST 2010


Tim Harig, 26.12.2010 02:05:
> On 2010-12-25, Nobody<nobody at nowhere.com>  wrote:
>> On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote:
>>> Of course, one advantage of XML is that with so much redundant text, it
>>> compresses well.  We typically see gzip compression ratios of 20:1.
>>> But, that just means you can archive them efficiently; you can't do
>>> anything useful until you unzip them.
>>
>> XML is typically processed sequentially, so you don't need to create a
>> decompressed copy of the file before you start processing it.
>
> Sometimes XML is processed sequentially.  When the markup footprint is
> large enough it must be.  Quite often, as in the case of the OP, you only
> want to extract a small piece out of the total data.  In those cases, being
> forced to read all of the data sequentially is both inconvenient and and a
> performance penalty unless there is some way to address the data you want
> directly.

So what? If you only have to do that once, it doesn't matter if you have to 
read the whole file or just a part of it. Should make a difference of a 
couple of minutes.

If you do it a lot, you will have to find a way to make the access 
efficient for your specific use case. So the file format doesn't matter 
either, because the data will most likely end up in a fast data base after 
reading it in sequentially *once*, just as in the case above.

I really don't think there are many important use cases where you need fast 
random access to large data sets and cannot afford to adapt the storage 
layout before hand.

Stefan




More information about the Python-list mailing list