10GB XML Blows out Memory, Suggestions?

fuzzylollipop jarrod.roberson at gmail.com
Thu Jun 8 09:29:59 EDT 2006


axwack at gmail.com wrote:
> Thanks guys for all your posts...
>
> So I am a bit confused....Fuzzy, the code I saw looks like it
> decompresses as a stream (i.e. per byte). Is this the case or are you
> just compressing for file storage but the actual data set has to be
> exploded in memory?
>

it wasn't my code.

if you zip the 10GB and read from the zip into a DOM style tree, you
haven't gained anything, except adding additional CPU requirements to
do the decompression. You still have to load the entire thing into
memory.

There are differences in XML Parsers, IN EVERY LANGUAGE a poorly
written parser is a poorly written parser. Using the wrong IDIOM is
more of a problem than anything else. DOM parsers are good when you
need to read and process every element and attribute and the data is
"small". Granted, "small" is relative, but no body will consider 10GB
"small".

SAX style or a pull-parser has to be used when the data is "large" or
when you don't really need to process every element and attribute.

This problem looks like it is just a data export / import problem. In
that case you will either have to use a sax style parser and parse the
10GB file. Or as I suggested in another reply, export the data in
smaller chunks and process them separately, which in almost EVERY case
is a better solution to do batch processing.

You should always break processing up into as many discreet steps as
possible. Make for easier debugging and you can start over in the
middle much easier.

Even if you just write a simple SAX style parser to just break the file
up into smaller pieces to actually process it you will be ahead of the
game.

We have systems that process streaming data coming from sockets in XML
format, that run in Java with very little memory footprint and very
little CPU usage. At 50 megabit a sec, that is about 4TB a day. C
wouldn't read from a socket any faster than the NBIO, actually it would
be harder to get the same performance in C because we would have to
duplicate all the SEDA style NBIO.




More information about the Python-list mailing list