10GB XML Blows out Memory, Suggestions?
Ralf Muschall
rmuschall at tecont.de
Wed Jun 7 14:10:48 EDT 2006
Paul McGuire schrieb:
> meat of the data can be relatively small. Note also that this XML overhead
> is directly related to the verbosity of the XML designer's choice of tag
> names, and whether the designer was predisposed to using XML elements over
> attributes. Imagine a record structure for a 3D coordinate point (described
> here in no particular coding language):
> struct ThreeDimPoint:
> xValue : integer,
> yValue : integer,
> zValue : integer
> Directly translated to XML gives:
> <ThreeDimPoint>
> <xValue>4</xValue>
> <yValue>5</yValue>
> <zValue>6</zValue>
> </ThreeDimPoint>
This is essentially true, but should not cause the OP's problem.
After parsing, the overhead of XML is gone, and long tag names
are nothing but pointers to a string which happens to be long
(unless *all* tags in the XML are differently named, which would
cause a huge DTD/XSD as well).
> This expands 3 integers to a whopping 101 characters. Throw in namespaces
> for good measure, and you inflate the data even more.
In the DOM, it contracts to 3 integers and a few pointers -
essentially the same as needed in a reasonably written
data structure.
> Try zipping your 10Gb file, and see what kind of compression you get - I'll
> bet it's close to 30:1. If so, convert the data to a real data storage
In this case, his DOM (or whatever equivalent data structure, i.e.
that what he *must* process) would be 300 MB + pointers.
I'd even go as far and say that the best thing that can happen to
him is a huge overhead - this would mean he has a little data
in a rather spongy file (which collapses on parsing).
> medium. Even a SQLite database table should do better, and you can ship it
> around just like a file (just can't open it up like a text file).
A table helps only if the data is tabular (i.e. a single relation),
i.e. probably never (otherwise the sending side would have shipped
something like CSV).
Ralf
More information about the Python-list
mailing list