[Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays

Fri Nov 9 01:00:13 EST 2007

On 08/11/2007, David Cournapeau <david at ar.media.kyoto-u.ac.jp> wrote:

> For copy and array creation, I understand this, but for element-wise
> operations (mean, min, and max), this is not enough to explain the
> difference, no ? For example, I can understand a 50 % or 100 % time
> increase for simple operations (by simple, I mean one element operation
> taking only a few CPU cycles), because of copies, but a 5 fold time
> increase seems too big, no (mayb a cache problem, though) ? Also, the
> fact that mean is slower than min/max for both cases (F vs C) seems a
> bit counterintuitive (maybe cache effects are involved somehow ?).

I have no doubt at all that cache effects are involved: for an int
array, each data element is four bytes, but typical CPUs need to load
64 bytes at a time into cache. If you read (or write) the rest of
those bytes in the next iterations through the loop, the (large) cost
of a memory read is amortized. If you jump to the next row of the
array, some large number of bytes away, those 64 bytes basically need
to be purged to make room for another 64 bytes, of which you'll use 4.
If you're reading from a FORTRAN-order array and writing to a C-order
one, there's no way around doing this on one end or another: you're
effectively doing a transpose, which is pretty much always expensive.

Is there any reason not to let ufuncs pick whichever order for
newly-allocated arrays they want? The natural choice would be the same
as the bigger input array, if it's a contiguous block of memory
(whether or not the contiguous flags are set). Failing that, the same
as the other input array (if any); failing that, C order's as good a
default as any. How difficult would this be to implement?

Anne