lisp is winner in DOM parsing contest! 8-]

R. Mattes ralf at mh-freiburg.de
Mon Jul 12 08:43:10 EDT 2004


On Mon, 12 Jul 2004 02:19:03 +0200, Alex Mizrahi wrote:

> Hello, All!
> 
> i have 3mb long XML document with about 150000 lines (i think it has
> about 200000 elements there) which i want to parse to DOM to work with.
> first i thought there will be no problems, but there were.. first i
> tried Python.. there's special interest group that wants to "make Python
> become the premier language for XML processing" so i thought there will
> be no problems with this document.. i used xml.dom.minidom to parse it..
> after it ate 400 meg of RAM i killed it - i don't want such processing..
> i think this is because of that fat class impementation - possibly
> Python had some significant overhead for each object instance, or
> something like this..

First of all: which parser did you actually use? There are quite a number
of XML parsers for python. I personally use the libxml2 one and never had
memory proplems like you describe.

> then i asdf-installed s-xml package and tried it with it. it ate only 25
> megs for lxml representation. i think interning element names helped a
> lot.. it was CLISP that has unicode inside, so i think it could be even
> less without unicode..

Hmmm. Hmmm ... i guess you know that you compare apples with pears? 
S-XML is a nice, small parser but nowhere near a standard conformant 
XML parser. Have a closer look at the webpage: no handling of character
encoding, no CDATA, can't handle what the author calls "special tags" 
(like processing instruction), no schema/DTD support, and, most
important, no namespace support!

> then i tried C++ - TinyXML. it was fast, but ate 65 megs.. ye, looks
> like interning helps a lot 8-]

Interning is _much_ easier without namespaces.
 
> then i tried Perl XML::DOM.. it was better than python - about 180megs,
> but it was slowest.. at least it consumed mem slower than python 8-]
> 
> and java.. with default parser it took 45mbs.. maybe it interned
> strings, but there was overhead from classes - storing trees is
> definitely what's lisp optimized for 8-]

But you never got to a _full_ DOM with you lxml parsing. What you got was
a list-of-lists. There's no 'parent' implementation for your lxml
elements (which means that you might need to path the whole thing
arround all the time).

If you want a serious comparison you either need to compare s-xml with
similar "lightweight" parsers in Perl/Python/Ruby etc. or write your own
fully DOM compliant parser in LISP (or is there one allready? I'm still
looking for a good one).

 Just my 0.02 $

   Ralf Mattes


> so lisp is winner.. but it has not standard way (even no non-standard
> but simple) way to write binary IEEE floating point representation, so
> common lisp suck and i will use c++ for my task.. 8-]]]
> 
> With best regards, Alex 'killer_storm' Mizrahi.



More information about the Python-list mailing list