10GB XML Blows out Memory, Suggestions?

axwack at gmail.com axwack at gmail.com
Tue Jun 6 15:53:58 EDT 2006


Paul,

This is interesting. Unfortunately, I have no control over the XML
output. The file is from Goldmine. However, you have given me an
idea...

Is it possible to read an XML document in compressed format?
Paul McGuire wrote:
> <axwack at gmail.com> wrote in message
> news:1149594519.098115.8980 at u72g2000cwu.googlegroups.com...
> > I wrote a program that takes an XML file into memory using Minidom. I
> > found out that the XML document is 10gb.
> >
> > I clearly need SAX or something else?
> >
>
> You clearly need something instead of XML.
>
> This sounds like a case where a prototype, which worked for the developer's
> simple test data set, blows up in the face of real user/production data.
> XML adds lots of overhead for nested structures, when in fact, the actual
> meat of the data can be relatively small.  Note also that this XML overhead
> is directly related to the verbosity of the XML designer's choice of tag
> names, and whether the designer was predisposed to using XML elements over
> attributes.  Imagine a record structure for a 3D coordinate point (described
> here in no particular coding language):
>
> struct ThreeDimPoint:
>     xValue : integer,
>     yValue : integer,
>     zValue : integer
>
> Directly translated to XML gives:
>
> <ThreeDimPoint>
>     <xValue>4</xValue>
>     <yValue>5</yValue>
>     <zValue>6</zValue>
> </ThreeDimPoint>
>
> This expands 3 integers to a whopping 101 characters.  Throw in namespaces
> for good measure, and you inflate the data even more.
>
> Many Java folks treat XML attributes as anathema, but look how this cuts
> down the data inflation:
>
> <ThreeDimPoint xValue="4" yValue="5" zValue="6"/>
>
> This is only 50 characters, or *only* 4 times the size of the contained data
> (assuming 4-byte integers).
>
> Try zipping your 10Gb file, and see what kind of compression you get - I'll
> bet it's close to 30:1.  If so, convert the data to a real data storage
> medium.  Even a SQLite database table should do better, and you can ship it
> around just like a file (just can't open it up like a text file).
> 
> -- Paul




More information about the Python-list mailing list