10GB XML Blows out Memory, Suggestions?

Wed Jun 7 12:59:48 EDT 2006

Point for Fredrik. If someone doesn't recognize the inherent
performance differences between different XML parsers they haven't
experienced the pain (and eventual victory) of trying to optimize their
techniques for working with the albatross that XML can be :-)

Fredrik Lundh wrote:
> fuzzylollipop wrote:
>
> > dependes on the CODE and the SIZE of the file, in this case
> > processing 10GB of file, unless that file is heavly encrypted or
> > compressed will, the process will be IO bound PERIOD!
>
> so the fact that
>
>      for token, node in pulldom.parse(file):
>          pass
>
> is 50-200% slower than
>
>      for event, elem in ET.iterparse(file):
>           if elem.tag == "item":
>               elem.clear()
>
> when reading a gigabyte-sized XML file, is due to an unexpected slowdown
> in the I/O subsystem after importing xml.dom?
>
> > I work with TeraBytes of files, and all our Python code is just as fast
> > as equivelent C code for IO bound processes.
>
> so how large are the things that you're actually *processing* in your
> Python code?  megabyte blobs or 100-1000 byte records?  or even smaller
> things?
> 
> </F>