[Python-Dev] sum(...) limitation

Sat Aug 9 07:08:45 CEST 2014

On Fri, Aug 08, 2014 at 10:20:37PM -0400, Alexander Belopolsky wrote:
> On Fri, Aug 8, 2014 at 8:56 PM, Ethan Furman <ethan at stoneleaf.us> wrote:
> 
> > I don't use sum at all, or at least very rarely, and it still irritates me.
> 
> 
> You are not alone.  When I see sum([a, b, c]), I think it is a + b + c, but
> in Python it is 0 + a + b + c.  If we had a "join" operator for strings
> that is different form + - then sure, I would not try to use sum to join
> strings, but we don't.

I've long believed that + is the wrong operator for concatenating 
strings, and that & makes a much better operator. We wouldn't be having 
these interminable arguments about using sum() to concatenate strings 
(and lists, and tuples) if the & operator was used for concatenation and 
+ was only used for numeric addition.

> I have always thought that sum(x) is just a
> shorthand for reduce(operator.add, x), but again it is not so in Python.

The signature of reduce is:

reduce(...)
    reduce(function, sequence[, initial]) -> value

so sum() is (at least conceptually) a shorthand for reduce:

def sum(values, initial=0):
    return reduce(operator.add, values, initial)

but that's an implementation detail, not a language promise, and sum() 
is free to differ from that simple version. Indeed, even the public 
interface is different, since sum() prohibits using a string as the 
initial value and only promises to work with numbers. The fact that it 
happens to work with lists and tuples is somewhat of an accident of 
implementation.

> While "sum should only be used for numbers,"  it turns out it is not a
> good choice for floats - use math.fsum.

Correct. And if you (generic you, not you personally) do not understand 
why simple-minded addition of floats is troublesome, then you're going 
to have a world of trouble. Anyone who is disturbed by the question of 
"should I use sum or math.fsum?" probably shouldn't be writing serious 
floating point code at all. Floating point computations are hard, and 
there is simply no escaping this fact.

> While "strings are blocked because
> sum is slow," numpy arrays with millions of elements are not.

That's not a good example. Strings are potentially O(N**2), which means 
not just "slow" but *agonisingly* slow, as in taking a week -- no 
exaggeration -- to concat a million strings. If it takes a nanosecond to 
concat two strings, then 1e6**2 such concatenations could take over 
eleven days. Slowness of such magnitude might as well be "the process 
has locked up".

In comparison, summing a numpy array with a million entries is not 
really slow in that sense. The time taken is proportional to the number 
of entries, and differs from summing a list only by a constant factor.

Besides, in the case of strings it is quite simple to decide "is the 
initial value a string?", whereas with lists or numpy arrays it's quite 
hard to decide "is the list or array so huge that the user will consider 
this too slow?". What counts as "too slow" depends on the machine it is 
running on, what other processes are running, and the user's mood, and 
leads to the silly result that summing an array of N items succeeds but 
N+1 items doesn't. So in the case of strings, it is easy to make a
blanket prohibition, but in the case of lists or arrays, there is no 
reasonable place to draw the line.

> And try to
> explain to someone that sum(x) is bad on a numpy array, but abs(x) is fine.

I think that's because sum() has to box up each and every element in the 
array into an object, which is wasteful, while abs() can delegate to a 
specialist array.__abs__ method. Although that's not something beginners 
should be expected to understand, no serious Python programmer should be 
confused by this. As a programmer, we should expect to have some 
understanding of our tools, how they work, their limitations, and when 
to use a different tool. That's why numpy has its own version of sum 
which is designed to work specifically on numpy arrays. Use a specialist 
tool for a specialist job:

py> with Stopwatch():
...     sum(carray)  # carray is a numpy array of 75000000 floats.
...
112500000.0
time taken: 52.659770 seconds
py> with Stopwatch():
...     numpy.sum(carray)
...
112500000.0
time taken: 0.161263 seconds

>  Why have builtin sum at all if its use comes with so many caveats?

Because sum() is a perfectly reasonable general purpose tool for adding 
up small amounts of numbers where high floating point precision is not 
required. It has been included as a built-in because Python comes with 
"batteries included", and a basic function for adding up a few numbers 
is an obvious, simple battery. But serious programmers should be 
comfortable with the idea that you use the right tool for the right job.

If you visit a hardware store, you will find that even something as 
simple as the hammer exists in many specialist varieties. There are tack 
hammers, claw hammers, framing hammers, lump hammers, rubber and wooden 
mallets, "brass" non-sparking hammers, carpet hammers, brick hammers, 
ball-peen and cross-peen hammers, and even more specialist versions like 
geologist's hammers. Bashing an object with something hard is remarkably 
complicated, and there are literally dozens of types and sizes of "the 
hammer".  Why should it be a surprise that there are a handful of 
different ways to sum items?

-- 
Steven