xml processing : too slow...

Alex Martelli aleax at aleax.it
Wed Jul 24 11:23:09 EDT 2002


Shagshag13 wrote:

> hello,
> 
> i need to process *each line* of  many huge files (> 2 million lines) with
> xml processing, by now i do it with parseString from xml.dom.minidom and
> it's work.

*shudder* don't use minidom except on SMALL files!!!


> i do xml validation, extract attributes and tags.

minidom is not validating anyway.  You probably just check for
well-formedness (validation means adherence to a DTD etc).


> that's really slow, how do you think i can speed it up ? i'm thinking of
> writing a xml mini wrapper using things like string.find or even regular
> expressions, do you this that's a good idea ?

No, it's a wretched idea.  Use xml.sax or xml.dom.pulldom, the tools
meant to be used with large XML files.  minidom has to build up
in-memory structures, typically several times larger than the input
XML file, so your virtual memory is probably thrashing badly -- eep.

Actually I hadn't used xml.dom.minidom in AGES (had to refresh my
dim memories to write the XML chapter for the Nutshell) -- SAX is
just TOO much better for typical XML-parsing tasks in my opinion.

Clearly not everybody agrees, or Paul Prescod, a great XML as well
as Python expert, wouldn't have developed xml.dom.pulldom, but then
I guess variety is the spice of life.  I'm very used to thinking in
event-driven terms (the only way to go for GUI's, the fastest for
networks, the one supported by SGMLParser and HTMLParser too) so
SAX feels as comfortable as a pair of Mephisto shoes to me (I'd
wear no other brand under any condition I can imagine -- I don't
want my feet to hate me, after all).


Just one performance tip: if you need to buffer incoming characters,
and you typically do since the parser can split them, DON'T use
a string attribute doing something like self.data += moredata --
that's a performance disaster too.  Use a LIST attribute, do
self.data.append(moredata), and ''.join(self.data) if you need
the data in string form when you know you have collected all
you need for now.


Alex




More information about the Python-list mailing list