[Python-Dev] sum(...) limitation

Steven D'Aprano steve at pearwood.info
Mon Aug 4 20:10:18 CEST 2014


On Mon, Aug 04, 2014 at 09:25:12AM -0700, Chris Barker wrote:

> Good point -- I was trying to make the point about .join() vs + for strings
> in an intro python class last year, and made the mistake of having the
> students test the performance.
> 
> You need to concatenate a LOT of strings to see any difference at all --  I
> know that O() of algorithms is unavoidable, but between efficient python
> optimizations and a an apparently good memory allocator, it's really a
> practical non-issue.

If only that were the case, but it isn't. Here's a cautionary tale for 
how using string concatenation can blow up in your face:

Chris Withers asks for help debugging HTTP slowness:
https://mail.python.org/pipermail/python-dev/2009-August/091125.html

and publishes some times:
https://mail.python.org/pipermail/python-dev/2009-September/091581.html

(notice that Python was SIX HUNDRED times slower than wget or IE)

and Simon Cross identified the problem:
https://mail.python.org/pipermail/python-dev/2009-September/091582.html

leading Guido to describe the offending code as an embarrassment.

It shouldn't be hard to demonstrate the difference between repeated 
string concatenation and join, all you need do is defeat sum()'s 
prohibition against strings. Run this bit of code, and you'll see a 
significant difference in performance, even with CPython's optimized 
concatenation:

# --- cut ---
class Faker:
    def __add__(self, other):
            return other

x = Faker()
strings = list("Hello World!")
assert ''.join(strings) == sum(strings, x)

from timeit import Timer
setup = "from __main__ import x, strings"
t1 = Timer("''.join(strings)", setup)
t2 = Timer("sum(strings, x)", setup)

print (min(t1.repeat()))
print (min(t2.repeat()))
# --- cut ---


On my computer, using Python 2.7, I find the version using sum is nearly 
4.5 times slower, and with 3.3 about 4.2 times slower. That's with a 
mere twelve substrings, hardly "a lot". I tried running it on IronPython 
with a slightly larger list of substrings, but I got sick of waiting for 
it to finish.

If you want to argue that microbenchmarks aren't important, well, I 
might agree with you in general, but in the specific case of string 
concatenation there's that pesky factor of 600 slowdown in real world 
code to argue with.


> Blocking sum( some_strings) because it _might_ have poor performance seems
> awfully pedantic.

The rationale for explicitly prohibiting strings while merely implicitly 
discouraging other non-numeric types is that beginners, who are least 
likely to understand why their code occasionally and unpredictably 
becomes catastrophically slow, are far more likely to sum strings than 
sum tuples or lists.

(I don't entirely agree with this rationale, I'd prefer a warning rather 
than an exception.)



-- 
Steven


More information about the Python-Dev mailing list