String concatenation performance with +=
Sammo
sammo2828 at gmail.com
Sat Feb 14 07:15:19 EST 2009
On Feb 14, 5:33 pm, Steven D'Aprano <st... at pearwood.info> wrote:
> > AFAIK, using list mutation and "".join only improves performance if
> > the "".join is executed outside of the loop.
>
> Naturally. If you needlessly join over and over again, instead of delaying
> until the end, then you might as well do string concatenation without the
> optimization.
>
> join() isn't some magical function that makes your bugs go away. You have to
> use it sensibly. What you've done is a variant of Shlemiel the
> road-painter's algorithm:
>
> http://www.joelonsoftware.com/articles/fog0000000319.html
>
> Perhaps you have to repeatedly do work on the *temporary* results in between
> loops? Something like this toy example?
>
> s = ""
> block = "abcdefghijklmnopqrstuvwxyz"*1000
> for i in xrange(1000):
> s += block
> # do something with the partially concatenated string
> print "partial length is", len(s)
> # all finished
> do_something(s)
>
> You've written that using join:
>
> L = []
> block = "abcdefghijklmnopqrstuvwxyz"*1000
> for i in xrange(1000):
> L.append(block)
> # do something with the partially concatenated string
> L = [ ''.join(L) ]
> print "partial length is", len(L[0])
> # all finished
> s = ''.join(L)
> do_something(s)
>
> Okay, but we can re-write that more sensibly:
>
> L = []
> block = "abcdefghijklmnopqrstuvwxyz"*1000
> for i in xrange(1000):
> L.append(block)
> # do something with the partially concatenated string
> print "partial length is", sum(len(item) for item in L)
> # all finished
> s = ''.join(L)
> do_something(s)
>
> There's still a Shlemiel algorithm in there, but it's executed in fast C
> instead of slow Python and it doesn't involve copying large blocks of
> memory, only walking them, so you won't notice as much. Can we be smarter?
>
> L = []
> block = "abcdefghijklmnopqrstuvwxyz"*1000
> partial_length = 0
> for i in xrange(1000):
> L.append(block)
> partial_length += len(block)
> # do something with the partially concatenated string
> print "partial length is", partial_length
> # all finished
> s = ''.join(L)
> do_something(s)
>
> Naturally this is a toy example, but I think it illustrates one technique
> for avoiding turning an O(N) algorithm into an O(N**2) algorithm.
So even though I can't delay the "".join operation until after the
loop, I can still improve performance by reducing the length of
"".join operations inside the loop. In my case, I need to temporarily
work on the latest two blocks only.
L = []
block = "abcdefghijklmnopqrstuvwxyz"*1000
for i in xrange(1000):
L.append(block)
# do something with the latest two blocks
tmp = "".join(L[-2:])
# all finished
s = ''.join(L)
do_something(s)
Unfortunately, the downside is that the code becomes more difficult to
read compared to using the obvious +=. If only the optimization worked
for attributes ...
More information about the Python-list
mailing list