String concatenation benchmarking weirdness

Fri Jan 11 15:16:48 EST 2013

On Fri, Jan 11, 2013 at 12:03 PM, Rotwang <sg552 at hotmail.co.uk> wrote:
> Hi all,
>
> the other day I 2to3'ed some code and found it ran much slower in 3.3.0 than
> 2.7.2. I fixed the problem but in the process of trying to diagnose it I've
> stumbled upon something weird that I hope someone here can explain to me. In
> what follows I'm using Python 2.7.2 on 64-bit Windows 7. Suppose I do this:
>
> from timeit import timeit
>
> # find out how the time taken to append a character to the end of a byte
> # string depends on the size of the string
>
> results = []
> for size in range(0, 10000001, 100000):
>     results.append(timeit("y = x + 'a'",
>     setup = "x = 'a' * %i" % size, number = 1))
>
> If I plot results against size, what I see is that the time taken increases
> approximately linearly with the size of the string, with the string of
> length 10000000 taking about 4 milliseconds. On the other hand, if I replace
> the statement to be timed with "x = x + 'a'" instead of "y = x + 'a'", the
> time taken seems to be pretty much independent of size, apart from a few
> spikes; the string of length 10000000 takes about 4 microseconds.
>
> I get similar results with strings (but not bytes) in 3.3.0. My guess is
> that this is some kind of optimisation that treats strings as mutable when
> carrying out operations that result in the original string being discarded.
> If so it's jolly clever, since it knows when there are other references to
> the same string:
>
> timeit("x = x + 'a'", setup = "x = y = 'a' * %i" % size, number = 1)
>     # grows linearly with size
>
> timeit("x = x + 'a'", setup = "x, y = 'a' * %i", 'a' * %i"
> % (size, size), number = 1)
>     # stays approximately constant
>
> It also can see through some attempts to fool it:
>
> timeit("x = ('' + x) + 'a'", setup = "x = 'a' * %i" % size, number = 1)
>     # stays approximately constant
>
> timeit("x = x*1 + 'a'", setup = "x = 'a' * %i" % size, number = 1)
>     # stays approximately constant
>
> Is my guess correct? If not, what is going on? If so, is it possible to
> explain to a programming noob how the interpreter does this?

Basically, yes.  You can find the discussion behind that optimization at:

http://bugs.python.org/issue980695

It knows when there are other references to the string because all
objects in CPython are reference-counted.  It also works despite your
attempts to "fool" it because after evaluating the first operation
(which is easily optimized to return the string itself in both cases),
the remaining part of the expression is essentially "x = TOS + 'a'",
where x and the top of the stack are the same string object, which is
the same state the original code reaches after evaluating just the x.

The stated use case for this optimization is to make repeated
concatenation more efficient, but note that it is still generally
preferable to use the ''.join() construct, because the optimization is
specific to CPython and may not exist for other Python
implementations.

> And is there a reason why it doesn't work with bytes in 3.3?

No idea. Probably just never got implemented due to a lack of demand.