[Tutor] Trying to parse a HUGE(1gb) xml file in python

David Hutto smokefloat at gmail.com
Tue Dec 21 12:08:43 CET 2010


On Tue, Dec 21, 2010 at 5:49 AM, Stefan Behnel <stefan_ml at behnel.de> wrote:
> David Hutto, 21.12.2010 11:29:
>>
>> On Tue, Dec 21, 2010 at 5:19 AM, Stefan Behnel wrote:
>>>
>>> Alan Gauld, 21.12.2010 10:58:
>>>>>
>>>>> 22 Jan 2009 ... Stripping Illegal Characters from XML in Python>>
>>>>
>>>> ... I'd be asking Python to process 6.4 gigabytes of CSV into
>>>> 6.5 gigabytes of XML 1. ..... In fact, what happened was that
>>>> the parsing didn't work and the whole db was ...
>>>>
>>>> And I thought a 1G file was extreme... Do these people stop to think
>>>> that
>>>> with XML as much as 80% of their "data" is just description (ie the
>>>> tags).
>>>
>>> As I already said, it compresses well. In run-length compressed XML
>>> files,
>>> the tags can easily take up a negligible amount of space compared to the
>>> more widely varying data content (although that also commonly tends to
>>> compress rather well). And depending on how fast your underlying storage
>>> is,
>>> decompressing and parsing the file may still be faster than parsing a
>>> huge
>>> uncompressed file directly. So, again, the shear uncompressed file size
>>> is
>>> *not* a very interesting argument.
>>
>> However, could they (as mentioned elsewhere, and by other in another
>> form)mitigate the damage by using smaller tags exclusively?
>
> Why should that have a (noticeable) impact on the compressed file? It's the
> inherent nature of compression to reduce redundancy, which in XML files
> usually includes the redundancy of repeated tag names (even if the
> compression is not specifically XML aware).
>
> It's a very bad idea to use short and obfuscated tag names to reduce the
> storage size.


Maybe my style is a form of bad coder example, in some areas(present
company accepted). For example, I have a dictionary that has codes
within a text file, that point to other lines for verbs, adj, nouns,
etc.
So <a> doesn't have to mean a it could mean <a> = <antonym>, but would
that help in making the initial usage of <a> in the xml file faster,
or slower, by parsing for <a> then relating <a> to <antonym>?


That's like coding in assembler to reduce the size of the
> source code.

Haven't gotten to assembler yet, almost there.


 Just use compression for storage, or buy a larger hard disk for
> your NAS.
>
>
>> And also compressed is formatted, even for the tags, correct?
>
> The (lossless) compression doesn't change the content.

google search later, I promise.


>
> Stefan
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>



-- 
They're installing the breathalyzer on my email account next week.


More information about the Tutor mailing list