[Numpy-discussion] odd performance of sum?

Sat Feb 12 19:02:03 EST 2011

Hi Sturla,

On Sat, Feb 12, 2011 at 5:38 PM, Sturla Molden <sturla at molden.no> wrote:

> Den 10.02.2011 16:29, skrev eat:
> > One would expect sum to outperform dot with a clear marginal. Does
> > there exixts any 'tricks' to increase the performance of sum?
>
First of all, thanks for you still replying. Well, I'm still a little bit
unsure how I should proceed with this discussion... I may have used bad
wordings and created unneccessary mayhyem with my original question (:.
Trust me, I'm only trying to discuss here  with a positive criticism in my
mind.

Now, I'm not pretending to know what kind of a person a 'typical' numpy user
is. But I'm assuming that there just exists more than me with roughly
similar questions in their (our) minds and who wish to utilize numpy more
 'pythonic; all batteries included' way. Ocassionally I (we) may ask really
stupid questions, but please beare with us.

Said that, I'm still very confident that (from a users point of view)
there's some real substance on the issue I addressed.

> I see that others have ansvered already. The ufunc np.sum is not going
> going to beat np.dot. You are racing the heavy machinery of NumPy (array
> iterators, type chekcs, bound checks, etc.) against level-3 BLAS routine
> DGEMM, the most heavily optimized numerical kernel ever written.

Fair enough.

> Also
> beware that computation is much cheaper than memory access.

Sure, that's exactly where I expected the performance boost to emerge.

> Although
> DGEMM does more arithmetics, and even is O(N3) in that respect, it is
> always faster except for very sparse arrays. If you need fast loops, you
> can always write your own Fortran or C, and even insert OpenMP pragmas.

That's a very important potential, but surely not all numpy users are
expected to master that ;-)

> But don't expect that to beat optimized high-level BLAS kernels by any
> margin. The first chapters of "Numerical Methods in Fortran 90" might be
> worth reading. It deals with several of these issues, including
> dimensional expansion, which is important for writing fast numerical
> code -- but not intuitively obvious. "I expect this to be faster because
> it does less work" is a fundamental misconception in numerical
> computing. Whatever cause less traffic on the memory BUS (the real
> bottleneck) will almost always be faster, regardless of the amount of
> work done by the CPU.

And I'm totally aware of it, and actually it was exactly the original
intended logic of my question: "how bout if the sum could follow the steps
of dot; then, since less instructions it must be bounded below of
the execution time of dot". But as R. Kern gently pointed out allready it's
not fruitfull enough avenue to proceed. And I'm able to live with that.

Regards,
eat

> A good advice is to use high-level BLAS whenever
> you can. The only exception, as mentioned, is when matrices get very
> sparse.
>
> Sturla
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110213/a9b5f16c/attachment.html>