10GB XML Blows out Memory, Suggestions?

Wed Jun 7 12:30:07 EDT 2006

fuzzylollipop wrote:

> dependes on the CODE and the SIZE of the file, in this case
> processing 10GB of file, unless that file is heavly encrypted or
> compressed will, the process will be IO bound PERIOD!

so the fact that

     for token, node in pulldom.parse(file):
         pass

is 50-200% slower than

     for event, elem in ET.iterparse(file):
          if elem.tag == "item":
              elem.clear()

when reading a gigabyte-sized XML file, is due to an unexpected slowdown 
in the I/O subsystem after importing xml.dom?

> I work with TeraBytes of files, and all our Python code is just as fast
> as equivelent C code for IO bound processes.

so how large are the things that you're actually *processing* in your 
Python code?  megabyte blobs or 100-1000 byte records?  or even smaller 
things?

</F>