Why is xml.dom.minidom so slow?
Bjorn Pettersen
BPettersen at NAREX.com
Thu Jan 2 17:29:21 EST 2003
> From: Martin v. Löwis [mailto:martin at v.loewis.de]
>
> "Bjorn Pettersen" <BPettersen at NAREX.com> writes:
>
> > All I'm doing boils down to:
> >
> > response = rf.nextResponse()
> > dom = parseString(response)
> >
> > in a loop. Am I doing something wrong?
>
> You have to give more details. What Python version? PyXML or
> stock Python? One traditional reason is that people, not
> knowingly, have used PyXML xmlproc, which is a pure-Python
> parser, instead of Expat.
Python 2.2.1 without PyXML. The full code looks like:
def test():
from xml.dom.minidom import parseString
rf = ResponseFile('c:/data/Testoutput.xml')
count = 0
start = time.time()
try:
while 1:
# nextResponse() returns a complete xml
# document as a string (throws at eof).
response = rf.nextResponse()
dom = dom = parseString(response)
count += 1
sys.stdout.write('.')
except:
pass
stop = time.time()
return count, stop-start
If I'm reading the minidom/pulldom files correctly this should use Expat(?)
> PyXML 0.8.x has a number of speed improvements for
> minidom-with-expat (such as eliminating the SAX driver), and
> memory usage improvements (such as interning element and
> attribute names).
As a test, I tried building my own tree directly from the Expat events. This was about 4 times faster (2.89 accts/sec), but still far from fast enough... I'm starting to think a custom C++ parser might be the way to go (and here I was having such a nice day <sigh>).
> > Is there a faster way when all I need is a traversable tree
> > structure as the result?
>
> "All I need" reads quite funny in this context, as producing
> a traversable tree is one of the more expensive ways for XML
> processing. There are certainly faster ways if you *don't*
> need a traversable tree.
:-) Unfortunately they're not my requirements. (They go something like: "we will eventually need all the data, so put them in a form that the next step can traverse to put into a DB".) If you think a different approach is better I'm all ears :-)
Thanks for the interest.
-- bjorn
More information about the Python-list
mailing list