10GB XML Blows out Memory, Suggestions?

K.S.Sreeram sreeram at tachyontech.net
Tue Jun 6 14:23:21 EDT 2006


Diez B. Roggisch wrote:
> What the OP needs is a different approach to XML-documents that won't
> parse the whole file into one giant tree - but I'm pretty sure that
> (c)ElementTree will do the job as well as expat. And I don't recall the
> OP musing about performances woes, btw.


There's just NO WAY that the 10gb xml file can be loaded into memory as
a tree on any normal machine, irrespective of whether we use C or
Python. So the *only* way is to perform some kind of 'stream' processing
on the file. Perhaps using a SAX like API. So (c)ElementTree is ruled
out for this.

Diez B. Roggisch wrote:
> No what exactly makes C grok a 10Gb file where python will fail to do so?

In most typical cases where there's any kind of significant python code,
its possible to achieve a *minimum* of a 10x speedup by using C. In most
cases, the speedup is not worth it and we just trade it for the
increased flexiblity/power of the python language. But in this situation
using a bit of tight C code could make the difference between the
process taking just 15mins or taking a few hours!

Ofcourse I'm not asking him to write the entire application in C. It
makes sense to just write the performance critical sections in C, and
wrap it in Python, and write the rest of the application in Python.



-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 260 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/python-list/attachments/20060606/efb8e013/attachment.sig>


More information about the Python-list mailing list