Why is xml.dom.minidom so slow?

Thu Jan 2 18:01:10 EST 2003

"Bjorn Pettersen" <BPettersen at NAREX.com> writes:

> If I'm reading the minidom/pulldom files correctly this should use
> Expat(?)

Yes, that is the only possible interpretation if no other parsers are
available.

> As a test, I tried building my own tree directly from the Expat
> events. This was about 4 times faster (2.89 accts/sec), but still
> far from fast enough... I'm starting to think a custom C++ parser
> might be the way to go (and here I was having such a nice day
> <sigh>).

I see. Then I would suggest that the mere parsing speed is not the
issue - this uses roughly all tricks we can think of. It still would
be interesting to find out where the computation time is spend. If
these are complicated documents (i.e. many elements and attributes,
short PCDATA), then surely memory allocation is an issue - you could
try Python 2.3a1 also, as a test (pymalloc should give some
improvements when there are many memory allocations).

I doubt that a custom parser can do much better, unless it allows you
to drop data you are not interested in.

What *has* been demonstrated to be a speed-up over minidom is to use
4Suite's cDomlette. It is faster, because:
- it allocates less objects: many things are stored in the elements
  themselves, instead of in dictionaries, as Python classic classes
  do.
- object creation is through C, with no need to lookup Python methods
  over and over again.

When completed, it still gives you a Python-conforming DOM tree. That
DOM tree misses some of the DOM functionality, though, that's why they
call it a Domlette.

> :-) Unfortunately they're not my requirements. (They go something
> :like: "we will eventually need all the data, so put them in a form
> :that the next step can traverse to put into a DB".) If you think a
> :different approach is better I'm all ears :-)

The stream-processing approaches are *much* faster, in all
languages. They don't create intermediate objects, but present you
with just the strings that the parser had to extract from the
document, anyway.

In order of increasing speed, decreasing standards conformance:
- SAX: depending on how you design the content handler, you can be
  much faster than a DOM builder already. As a test, you might want to
  plug in an empty ContentHandler, and see how many documents you
  can parse without processing in a certain time.
- Expat raw interface: parsing is XML-conforming, but the API of
  Expat is proprietary. This safes indirections, and is again faster.
  You can apply the same benchmark with little effort.
- PyXML//F sgmlop: to my knowledge, the fastest for-Python XML
  parser, but it misses a number of XML features (e.g. it won't
  do entity expansion).

In any case, please report what your findings are and what technology
you eventually use.

Regards,
Martin