xml processing : too slow...

Alex Martelli aleax at aleax.it
Wed Jul 24 11:23:09 EDT 2002

Shagshag13 wrote:

> hello,
> i need to process *each line* of  many huge files (> 2 million lines) with
> xml processing, by now i do it with parseString from xml.dom.minidom and
> it's work.

*shudder* don't use minidom except on SMALL files!!!

> i do xml validation, extract attributes and tags.

minidom is not validating anyway.  You probably just check for
well-formedness (validation means adherence to a DTD etc).

> that's really slow, how do you think i can speed it up ? i'm thinking of
> writing a xml mini wrapper using things like string.find or even regular
> expressions, do you this that's a good idea ?

No, it's a wretched idea.  Use xml.sax or xml.dom.pulldom, the tools
meant to be used with large XML files.  minidom has to build up
in-memory structures, typically several times larger than the input
XML file, so your virtual memory is probably thrashing badly -- eep.

Actually I hadn't used xml.dom.minidom in AGES (had to refresh my
dim memories to write the XML chapter for the Nutshell) -- SAX is
just TOO much better for typical XML-parsing tasks in my opinion.

Clearly not everybody agrees, or Paul Prescod, a great XML as well
as Python expert, wouldn't have developed xml.dom.pulldom, but then
I guess variety is the spice of life.  I'm very used to thinking in
event-driven terms (the only way to go for GUI's, the fastest for
networks, the one supported by SGMLParser and HTMLParser too) so
SAX feels as comfortable as a pair of Mephisto shoes to me (I'd
wear no other brand under any condition I can imagine -- I don't
want my feet to hate me, after all).

Just one performance tip: if you need to buffer incoming characters,
and you typically do since the parser can split them, DON'T use
a string attribute doing something like self.data += moredata --
that's a performance disaster too.  Use a LIST attribute, do
self.data.append(moredata), and ''.join(self.data) if you need
the data in string form when you know you have collected all
you need for now.


More information about the Python-list mailing list