lisp is winner in DOM parsing contest! 8-]

Uche Ogbuji uche at ogbuji.net
Fri Jul 16 12:40:59 EDT 2004


"Alex Mizrahi" <udodenko at hotmail.com> wrote in message news:<2le3nlFb82reU1 at uni-berlin.de>...
> Hello, All!
> 
> i have 3mb long XML document with about 150000 lines (i think it has about
> 200000 elements there) which i want to parse to DOM to work with.
> first i thought there will be no problems, but there were..
> first i tried Python.. there's special interest group that wants to "make
> Python become the premier language for XML processing" so i thought there
> will be no problems with this document..
> i used xml.dom.minidom to parse it.. after it ate 400 meg of RAM i killed
> it - i don't want such processing.. i think this is because of that fat
> class impementation - possibly Python had some significant overhead for each
> object instance, or something like this..

minidom has about a 60X - 80X load factor on average (comparing XML
file size to memory working set).  You're claiming you saw a 130X load
factor.  That sounds odd  Are there special characteristics of your
document you're not mentioning?

cDomlette, part of 4Suite, only has a 10X load factor, so I'd guess
your example would end up with a 30MB memory working set.  cDomlette
does use string interning, as one example of optimization techniques. 
4Suite also provides you XSLT, XPath, RELAX NG and some other
processing goodies.

See:

http://4suite.org/
http://www.xml.com/pub/a/2002/10/16/py-xml.html
http://uche.ogbuji.net/akara/nodes/2003-01-01/domlettes?xslt=/akara/akara.xslt

As you can see, cDomlette is as DOM-like as minidom, so very easy to
use (for the record neither is a compliant W3C DOM implementation).

Also, in my next XML.com article I cover a technique that uses SAX to
break large XML documents into series of small DOMs, one after the
other, so that the memory penalty is *very* low, depending on your
document structure.  It works with any DOM implementation that meets
the Python DOM binding, including minidom and cDomlette.

-- 
Uche
http://uche.ogbuji.net



More information about the Python-list mailing list