10GB XML Blows out Memory, Suggestions?

uche.ogbuji at gmail.com uche.ogbuji at gmail.com
Sun Jun 11 11:02:50 EDT 2006


K.S.Sreeram wrote:
> Fredrik Lundh wrote:
> > both ElementTree and cElementTree support "sax-style" event generation
> > (through XMLTreeBuilder/XMLParser) and incremental parsing (through
> > iterparse).  the cElementTree versions of these are even faster than
> > pyexpat.
> >
> > the iterparse interface is described here:
> >
> >      http://effbot.org/zone/element-iterparse.htm
> >
> Thats cool! Thanks for the info!
>
> For a multi-gigabyte file, I would still recommend C/C++, because the
> processing code which sits on top of the XML library needs to be Python,
> and that could turn out to be a significant overhead in such extreme cases.
>
> Of course, the exact strategy to follow would depend on the specifics of
> the case, and all this speculation may not really apply! :)

Honestly, i think that legitimate use-cases for multi-gigabyte XML are
very rare.  Many people abuse XML as some sort of DBMS replacement.
This abuse is part of the reason why so many developers are hostile to
XML.  XML is best for documents, and documents can get to the
multi-gigabyte range, but rarely do.  Usually, when they do, there is a
logical way to decompose them, process them, and re-compose them,
whereas with XML used as a DBMS replacement, relations and datatyping
complicate such natural divide-and-conquer techniques.

I always say that if you're dealing with gigabyte XML, it's well worth
considering whether you're not using a hammer to screw in a bolt.

If monster XML is inevitable, then I extend's Fredrik earlier mention
of Amara to say that Pushdom allows you to pre-declare the chunks of
XML you're interested in, and then it processes the XML in streaming
mode, only instantiating the chunks of interest one at a time.  This
allows for handling of huge files with a very simple programming idiom.

http://uche.ogbuji.net/tech/4suite/amara/

--
Uche Ogbuji                               Fourthought, Inc.
http://uche.ogbuji.net                    http://fourthought.com
http://copia.ogbuji.net                   http://4Suite.org
Articles: http://uche.ogbuji.net/tech/publications/




More information about the Python-list mailing list