Trying to parse a HUGE(1gb) xml file

Tim Harig usernet at ilthio.net
Sun Dec 26 04:22:10 EST 2010


On 2010-12-26, Stefan Behnel <stefan_ml at behnel.de> wrote:
> Tim Harig, 26.12.2010 02:05:
>> On 2010-12-25, Nobody<nobody at nowhere.com>  wrote:
>>> On Sat, 25 Dec 2010 14:41:29 -0500, Roy Smith wrote:
>>>> Of course, one advantage of XML is that with so much redundant text, it
>>>> compresses well.  We typically see gzip compression ratios of 20:1.
>>>> But, that just means you can archive them efficiently; you can't do
>>>> anything useful until you unzip them.
>>>
>>> XML is typically processed sequentially, so you don't need to create a
>>> decompressed copy of the file before you start processing it.
>>
>> Sometimes XML is processed sequentially.  When the markup footprint is
>> large enough it must be.  Quite often, as in the case of the OP, you only
>> want to extract a small piece out of the total data.  In those cases, being
>> forced to read all of the data sequentially is both inconvenient and and a
>> performance penalty unless there is some way to address the data you want
>> directly.
>
> So what? If you only have to do that once, it doesn't matter if you have to 
> read the whole file or just a part of it. Should make a difference of a 
> couple of minutes.

Much agreed.  I assume that the process needs to be repeated or it
probably would be simpler just to rip out what I wanted using regular
expressions with shell utilities.

> If you do it a lot, you will have to find a way to make the access 
> efficient for your specific use case. So the file format doesn't matter 
> either, because the data will most likely end up in a fast data base after 
> reading it in sequentially *once*, just as in the case above.

If the data is just going to end up in a database anyway; then why not
send it as a database to begin with and save the trouble of having to
convert it?



More information about the Python-list mailing list