String concatenation benchmarking weirdness

Rotwang sg552 at hotmail.co.uk
Fri Jan 11 14:03:41 EST 2013


Hi all,

the other day I 2to3'ed some code and found it ran much slower in 3.3.0 
than 2.7.2. I fixed the problem but in the process of trying to diagnose 
it I've stumbled upon something weird that I hope someone here can 
explain to me. In what follows I'm using Python 2.7.2 on 64-bit Windows 
7. Suppose I do this:

from timeit import timeit

# find out how the time taken to append a character to the end of a byte
# string depends on the size of the string

results = []
for size in range(0, 10000001, 100000):
     results.append(timeit("y = x + 'a'",
     setup = "x = 'a' * %i" % size, number = 1))

If I plot results against size, what I see is that the time taken 
increases approximately linearly with the size of the string, with the 
string of length 10000000 taking about 4 milliseconds. On the other 
hand, if I replace the statement to be timed with "x = x + 'a'" instead 
of "y = x + 'a'", the time taken seems to be pretty much independent of 
size, apart from a few spikes; the string of length 10000000 takes about 
4 microseconds.

I get similar results with strings (but not bytes) in 3.3.0. My guess is 
that this is some kind of optimisation that treats strings as mutable 
when carrying out operations that result in the original string being 
discarded. If so it's jolly clever, since it knows when there are other 
references to the same string:

timeit("x = x + 'a'", setup = "x = y = 'a' * %i" % size, number = 1)
     # grows linearly with size

timeit("x = x + 'a'", setup = "x, y = 'a' * %i", 'a' * %i"
% (size, size), number = 1)
     # stays approximately constant

It also can see through some attempts to fool it:

timeit("x = ('' + x) + 'a'", setup = "x = 'a' * %i" % size, number = 1)
     # stays approximately constant

timeit("x = x*1 + 'a'", setup = "x = 'a' * %i" % size, number = 1)
     # stays approximately constant

Is my guess correct? If not, what is going on? If so, is it possible to 
explain to a programming noob how the interpreter does this? And is 
there a reason why it doesn't work with bytes in 3.3?


-- 
I have made a thing that superficially resembles music:

http://soundcloud.com/eroneity/we-berated-our-own-crapiness



More information about the Python-list mailing list