[XML-SIG] speed question re DOM parsing

Walter Underwood wunder@ultraseek.com
Thu, 22 Jun 2000 09:37:56 -0700


I know this is a reply to a really old post, but the talk about
speedup reminded me that I hadn't answered it.

This is worth posting generally, because it is applicable to
lots of SAX handlers (add it to the documentation?).

--On Wednesday, May 31, 2000 8:58 PM -0600 Bjorn Pettersen
<bjorn@roguewave.com> wrote:
> 
> After some profiling, I found that most of the time was going into the
> else branch in the cdata method.  This branch is growing a string
> character by character by saying:
> 
>   elem.first_cdata = elem.first_cdata + data

I had one of those in my character data handler too. Parsing the
Old Testament took about 45 min, as I remember. The copies and
reallocs in concatenation are O(n**2). Save all the strings in
a list, then use string.join at the end. This is linear.

Here are the relevant fragments of the class with the handlers:

class XMLToText:
    def __init__(self):
        self.text = []

    def cdata(self, data):
        self.text.append(data)

    def finish(self):
        self.text = string.join(self.text,u'')

Note the Unicode string constant -- remove that for Python 1.5,
and add code to handle the UTF-8, if necessary.

wunder
--
Walter R. Underwood
Senior Staff Engineer, Ultraseek Corp.
http://www.ultraseek.com/