Trying to parse a HUGE(1gb) xml file

Sun Dec 26 03:32:04 EST 2010

Tim Harig, 26.12.2010 02:05:
> On 2010-12-25, Nobody<nobody at nowhere.com>  wrote:
>> On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote:
>>> Of course, one advantage of XML is that with so much redundant text, it
>>> compresses well.  We typically see gzip compression ratios of 20:1.
>>> But, that just means you can archive them efficiently; you can't do
>>> anything useful until you unzip them.
>>
>> XML is typically processed sequentially, so you don't need to create a
>> decompressed copy of the file before you start processing it.
>
> Sometimes XML is processed sequentially.  When the markup footprint is
> large enough it must be.  Quite often, as in the case of the OP, you only
> want to extract a small piece out of the total data.  In those cases, being
> forced to read all of the data sequentially is both inconvenient and and a
> performance penalty unless there is some way to address the data you want
> directly.

So what? If you only have to do that once, it doesn't matter if you have to 
read the whole file or just a part of it. Should make a difference of a 
couple of minutes.

If you do it a lot, you will have to find a way to make the access 
efficient for your specific use case. So the file format doesn't matter 
either, because the data will most likely end up in a fast data base after 
reading it in sequentially *once*, just as in the case above.

I really don't think there are many important use cases where you need fast 
random access to large data sets and cannot afford to adapt the storage 
layout before hand.

Stefan