Why is xml.dom.minidom so slow?

Thu Jan 2 17:29:21 EST 2003

> From: Martin v. Löwis [mailto:martin at v.loewis.de] 
> 
> "Bjorn Pettersen" <BPettersen at NAREX.com> writes:
> 
> > All I'm doing boils down to:
> > 
> >   response = rf.nextResponse()
> >   dom = parseString(response)
> > 
> > in a loop. Am I doing something wrong?
> 
> You have to give more details. What Python version? PyXML or 
> stock Python? One traditional reason is that people, not 
> knowingly, have used PyXML xmlproc, which is a pure-Python 
> parser, instead of Expat.

Python 2.2.1 without PyXML. The full code looks like:

def test():
  from xml.dom.minidom import parseString
  rf = ResponseFile('c:/data/Testoutput.xml')
  count = 0

  start = time.time()
  try:
      while 1:
          # nextResponse() returns a complete xml 
          # document as a string (throws at eof).
          response = rf.nextResponse()
          dom = dom = parseString(response)
          count += 1
          sys.stdout.write('.')
  except:
      pass
  stop = time.time()
  return count, stop-start

If I'm reading the minidom/pulldom files correctly this should use Expat(?)

> PyXML 0.8.x has a number of speed improvements for 
> minidom-with-expat (such as eliminating the SAX driver), and 
> memory usage improvements (such as interning element and 
> attribute names).

As a test, I tried building my own tree directly from the Expat events. This was about 4 times faster (2.89 accts/sec), but still far from fast enough... I'm starting to think a custom C++ parser might be the way to go (and here I was having such a nice day <sigh>).

> > Is there a faster way when all I need is a traversable tree 
> > structure as the result?
> 
> "All I need" reads quite funny in this context, as producing 
> a traversable tree is one of the more expensive ways for XML 
> processing. There are certainly faster ways if you *don't* 
> need a traversable tree.

:-)  Unfortunately they're not my requirements. (They go something like: "we will eventually need all the data, so put them in a form that the next step can traverse to put into a DB".) If you think a different approach is better I'm all ears :-)

Thanks for the interest.

-- bjorn