[Python-Dev] sum(...) limitation

Chris Barker chris.barker at noaa.gov
Fri Aug 8 00:06:18 CEST 2014


On Mon, Aug 4, 2014 at 11:10 AM, Steven D'Aprano <steve at pearwood.info>
wrote:

> On Mon, Aug 04, 2014 at 09:25:12AM -0700, Chris Barker wrote:
>
> > Good point -- I was trying to make the point about .join() vs + for
> strings
> > in an intro python class last year, and made the mistake of having the
> > students test the performance.
> >
> > You need to concatenate a LOT of strings to see any difference at all
>


> If only that were the case, but it isn't. Here's a cautionary tale for
> how using string concatenation can blow up in your face:
>
> Chris Withers asks for help debugging HTTP slowness:
> https://mail.python.org/pipermail/python-dev/2009-August/091125.html



Thanks for that -- interesting story. note that that was not suing sum() in
that case though, which is really the issue at hand.

It shouldn't be hard to demonstrate the difference between repeated
> string concatenation and join, all you need do is defeat sum()'s
> prohibition against strings. Run this bit of code, and you'll see a
> significant difference in performance, even with CPython's optimized
> concatenation:
>

well, that does look compelling, but what it shows is that
sum(a_list_of_strings) is slow compared to ''.join(a_list_of_stings). That
doesn't surprise me a bit -- this is really similar to why:

a_numpy_array.sum()

is going to be a lot faster than:

sum(a_numpy_array)

and why I'll tell everyone that is working with lots of numbers to use
numpy. ndarray.sum know what data type it's deaing with,a nd can do the
loop in C. similarly with ''.join() (though not as optimized.

But I'm not sure we're seeing the big O difference here at all -- but
rather the extra calls though each element in the list's __add__ method.

In the case where you already HAVE a big list of strings, then yes, ''.join
is the clear winner.

But I think the case we're often talking about, and I've tested with
students, is when you are building up a long string on the fly out of
little strings. In that case, you need to profile the full "append to list,
then call join()", not just the join() call:

# continued adding of strings ( O(n^2)? )
In [6]: def add_strings(l):
   ...:     s = ''
   ...:     for i in l:
   ...:         s+=i
   ...:     return s

Using append and then join ( O(n)? )
In [14]: def join_strings(list_of_strings):
   ....:     l = []
   ....:     for i in list_of_strings:
   ....:         l.append(i)
   ....:     return ''.join(l)

In [23]: timeit add_strings(strings)
1000000 loops, best of 3: 831 ns per loop

In [24]: timeit join_strings(strings)
100000 loops, best of 3: 1.87 µs per loop

## hmm -- concatenating is faster for a small list of tiny strings....

In [31]: strings = list('Hello World')* 1000

strings *= 1000
In [26]: timeit add_strings(strings)
1000 loops, best of 3: 932 µs per loop

In [27]: timeit join_strings(strings)
1000 loops, best of 3: 967 µs per loop

## now about the same.

In [31]: strings = list('Hello World')* 10000

In [29]: timeit add_strings(strings)
100 loops, best of 3: 9.44 ms per loop

In [30]: timeit join_strings(strings)
100 loops, best of 3: 10.1 ms per loop

still about he same?

In [31]: strings = list('Hello World')* 1000000

In [32]: timeit add_strings(strings)
1 loops, best of 3: 1.27 s per loop

In [33]: timeit join_strings(strings)
1 loops, best of 3: 1.05 s per loop

there we go -- slight advantage to joining.....

So this is why we've said that the common wisdom about string concatenating
isn't really a practical issue.

But if you already have the strings all in a list, then yes, join() is a
major win over sum()

In fact, I tried the above with sum() -- and it was really, really slow. So
slow I didn't have the patience to wait for it.

Here is a smaller example:

In [22]: strings = list('Hello World')* 10000

In [23]: timeit add_strings(strings)
100 loops, best of 3: 9.61 ms per loop

In [24]: timeit sum( strings, Faker() )
1 loops, best of 3: 246 ms per loop

So why is sum() so darn slow with strings compared to a simple loop with +=
?

(and if I try it with a list 10 times as long it takes "forever")

Perhaps the http issue cited was before some nifty optimizations in current
CPython?

-Chris





-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140807/83fcb573/attachment.html>


More information about the Python-Dev mailing list