10GB XML Blows out Memory, Suggestions?

Ralf Muschall rmuschall at tecont.de
Wed Jun 7 14:10:48 EDT 2006


Paul McGuire schrieb:

> meat of the data can be relatively small.  Note also that this XML overhead
> is directly related to the verbosity of the XML designer's choice of tag
> names, and whether the designer was predisposed to using XML elements over
> attributes.  Imagine a record structure for a 3D coordinate point (described
> here in no particular coding language):

> struct ThreeDimPoint:
>     xValue : integer,
>     yValue : integer,
>     zValue : integer

> Directly translated to XML gives:

> <ThreeDimPoint>
>     <xValue>4</xValue>
>     <yValue>5</yValue>
>     <zValue>6</zValue>
> </ThreeDimPoint>

This is essentially true, but should not cause the OP's problem.
After parsing, the overhead of XML is gone, and long tag names
are nothing but pointers to a string which happens to be long
(unless *all* tags in the XML are differently named, which would
cause a huge DTD/XSD as well).

> This expands 3 integers to a whopping 101 characters.  Throw in namespaces
> for good measure, and you inflate the data even more.

In the DOM, it contracts to 3 integers and a few pointers -
essentially the same as needed in a reasonably written
data structure.

> Try zipping your 10Gb file, and see what kind of compression you get - I'll
> bet it's close to 30:1.  If so, convert the data to a real data storage

In this case, his DOM (or whatever equivalent data structure, i.e.
that what he *must* process) would be 300 MB + pointers.
I'd even go as far and say that the best thing that can happen to
him is a huge overhead - this would mean he has a little data
in a rather spongy file (which collapses on parsing).

> medium.  Even a SQLite database table should do better, and you can ship it
> around just like a file (just can't open it up like a text file).

A table helps only if the data is tabular (i.e. a single relation),
i.e. probably never (otherwise the sending side would have shipped
something like CSV).

Ralf



More information about the Python-list mailing list