[Tutor] Trying to parse a HUGE(1gb) xml file in python

David Hutto smokefloat at gmail.com
Tue Dec 21 16:11:14 CET 2010


On Tue, Dec 21, 2010 at 10:03 AM, Stefan Behnel <stefan_ml at behnel.de> wrote:
> Alan Gauld, 21.12.2010 15:11:
>>
>> "Stefan Behnel" wrote
>>>>
>>>> And I thought a 1G file was extreme... Do these people stop to think
>>>> that
>>>> with XML as much as 80% of their "data" is just description (ie the
>>>> tags).
>>>
>>> As I already said, it compresses well. In run-length compressed XML
>>> files, the tags can easily take up a negligible amount of space compared
>>> to the more widely varying data content
>>
>> I understand how compression helps with the data transmission aspect.
>>
>>> compress rather well). And depending on how fast your underlying storage
>>> is, decompressing and parsing the file may still be faster than parsing a
>>> huge uncompressed file directly.
>>
>> But I don't understand how uncompressing a file before parsing it can
>> be faster than parsing the original uncompressed file?
>
> I didn't say "uncompressing a file *before* parsing it".

He didn't say utilizing code below Python either, but others will
argue the microseconds matter, and if that's YOUR standard, then keep
it for client and self.

 I meant
> uncompressing the data *while* parsing it. Just like you have to decode it
> for parsing, it's just an additional step to decompress it before decoding.
> Depending on the performance relation between I/O speed and decompression
> speed, it can be faster to load the compressed data and decompress it into
> the parser on the fly. lxml.etree (or rather libxml2) internally does that
> for you, for example, if it detects compressed input when parsing from a
> file.
>
> Note that these performance differences are tricky to prove in benchmarks,

Tricky and proven, then tell me what real time, and this is in
reference to a recent c++ discussion, is python used in ,andhow could
it be utilized in....say an aviation system to avoid a collision when
milliseconds are on the line?

> as repeating the benchmark usually means that the file is already cached in
> memory after the first run, so the decompression overhead will dominate in
> the second run. That's not what you will see in a clean run or for huge
> files, though.
>
> Stefan
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>



-- 
They're installing the breathalyzer on my email account next week.


More information about the Tutor mailing list