[Python-Dev] sum(...) limitation
Steven D'Aprano
steve at pearwood.info
Mon Aug 4 20:10:18 CEST 2014
On Mon, Aug 04, 2014 at 09:25:12AM -0700, Chris Barker wrote:
> Good point -- I was trying to make the point about .join() vs + for strings
> in an intro python class last year, and made the mistake of having the
> students test the performance.
>
> You need to concatenate a LOT of strings to see any difference at all -- I
> know that O() of algorithms is unavoidable, but between efficient python
> optimizations and a an apparently good memory allocator, it's really a
> practical non-issue.
If only that were the case, but it isn't. Here's a cautionary tale for
how using string concatenation can blow up in your face:
Chris Withers asks for help debugging HTTP slowness:
https://mail.python.org/pipermail/python-dev/2009-August/091125.html
and publishes some times:
https://mail.python.org/pipermail/python-dev/2009-September/091581.html
(notice that Python was SIX HUNDRED times slower than wget or IE)
and Simon Cross identified the problem:
https://mail.python.org/pipermail/python-dev/2009-September/091582.html
leading Guido to describe the offending code as an embarrassment.
It shouldn't be hard to demonstrate the difference between repeated
string concatenation and join, all you need do is defeat sum()'s
prohibition against strings. Run this bit of code, and you'll see a
significant difference in performance, even with CPython's optimized
concatenation:
# --- cut ---
class Faker:
def __add__(self, other):
return other
x = Faker()
strings = list("Hello World!")
assert ''.join(strings) == sum(strings, x)
from timeit import Timer
setup = "from __main__ import x, strings"
t1 = Timer("''.join(strings)", setup)
t2 = Timer("sum(strings, x)", setup)
print (min(t1.repeat()))
print (min(t2.repeat()))
# --- cut ---
On my computer, using Python 2.7, I find the version using sum is nearly
4.5 times slower, and with 3.3 about 4.2 times slower. That's with a
mere twelve substrings, hardly "a lot". I tried running it on IronPython
with a slightly larger list of substrings, but I got sick of waiting for
it to finish.
If you want to argue that microbenchmarks aren't important, well, I
might agree with you in general, but in the specific case of string
concatenation there's that pesky factor of 600 slowdown in real world
code to argue with.
> Blocking sum( some_strings) because it _might_ have poor performance seems
> awfully pedantic.
The rationale for explicitly prohibiting strings while merely implicitly
discouraging other non-numeric types is that beginners, who are least
likely to understand why their code occasionally and unpredictably
becomes catastrophically slow, are far more likely to sum strings than
sum tuples or lists.
(I don't entirely agree with this rationale, I'd prefer a warning rather
than an exception.)
--
Steven
More information about the Python-Dev
mailing list