[Python-ideas] Fast sum() for non-numbers

Steven D'Aprano steve at pearwood.info
Wed Jul 3 14:43:46 CEST 2013


On 03/07/13 04:12, Sergey wrote:
> Hello, python-ideas. Trying to cover everything in one shot, so
> the message is long, sorry.
>
> sum() is a great function. It is the "obvious way" to add things.
> Unfortunately sometimes it's slower than it could be.
>
> The problem is that code:
>    sum([[1,2,3]]*1000000, [])
> takes forever to complete. Let's fix that!

I'm not sure that sum() is the Obvious Way to concatenate lists, and I don't think that concatenating many lists is a common thing to do. Traditionally, sum() works only on numbers, and I think we wouldn't be having this discussion if Python used & for concatenation instead of +. So I don't care that sum() has quadratic performance on lists (and tuples), and I must admit that having a simple quadratic algorithm in the built-ins is sometimes useful for teaching purposes, so I'm -0 on optimizing this case.


[...]
> When people look at sum(seq,start=0) signature they most probably
> expect it to be like this:
>    def sum(seq, start = 0):
>      for item in seq:
>        start += item
>      return start

That's not what I expect, since += risks modifying its argument in place.


[...]
> What I suggest is instead of making a copy for every item make just one
> copy and reuse it as many times as needed. For example:
>    def sum(seq, start = 0):
>      start = start + seq[0]
>      for item in seq[1:]:
>        start += item
>      return start
[...]
> Can this implementation break anything?

That will still have quadratic performance for tuples, and it will break if seq is an iterator.

It will also be expensive to make a slice of seq if it is a huge list, and could even run out of memory to hold both the original and the slice. I would be annoyed if sum started failing with a MemoryError here:

big_seq = list(range(1000))*1000000  # assume this succeeds
total = sum(big_seq)  # could fail


[...]
> Alternatives to sum() for this use-case:
> * list comprehension is 270% slower than patched sum [2]
> * itertools.chain is 50%-100% slower than patched sum [3]

How about the "Obvious Way" to concatenate lists?

new = []
for x in seq:
     new.extend(x)



> Main questions:
> * Whether to accept the patch
> * If yes, whether it should go to the new version or to a bugfix release

-0 on the general idea, -1 on the specific implementation. I'd rather have sum of lists be slow than risk sum of numbers raise MemoryError.



-- 
Steven


More information about the Python-ideas mailing list