[Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays

Thu Nov 8 10:37:06 EST 2007

On Nov 8, 2007 10:13 PM, Hans Meine <meine at informatik.uni-hamburg.de> wrote:
> Am Donnerstag, 08. November 2007 13:44:59 schrieb David Cournapeau:
> > Hans Meine wrote:
> > > I wonder why simple elementwise operations like "a * 2" or "a + 1" are
> > > not performed in order of increasing memory addresses in order to exploit
> > > CPU caches etc. - as it is now, their speed drops by a factor of around 3
> > > simply by transpose()ing.
> >
> > Because it is not trivial to do so in all cases, I guess. It is a
> > problem which comes back time to time on the ML, but AFAIK, nobody had a
> > fix for it. Fundamentally, for many element-wise operations, either you
> > have to implement the thing for every possible case, or you abstract it
> > through an iterator, which gives you a decrease of performances in some
> > cases.
>
> OK, so far I have not looked at the NumPy internals, but I would expect sth.
> like the following:
>
> [elementwise operation on array 'a']
> 1) ii = [i for i, s in sorted(enumerate(ia.strides), key = lambda (i, s): -s)]
> 2) aprime = a.transpose(ii) # aprime will be "as C-contiguous as it gets"
> 3) bprime = perform_operation(aprime) # fast elementwise operation
> 4) jj = [ii.index(i) for i in range(bprime.ndim)] # inverse ii
> 5) b = bprime.transpose(jj) # let result have the same order as the input
The problem is not F vs C storage: for element-wise operation, it does
not matter at all; you just apply the same function
(perform_operation) over and over on every element of the array. The
order does not matter at all.

But what if you have segmented buffers, non aligned, etc... arrays ?
All this has to be taken care of, and this means handling reference
count and other things which are always delicate to handle well... Or
you just use the current situation, which let python handle it
(through PyObject_Call and a callable, at a C level).

I agree that in some cases, it would be useful to have better
performances: in particular, having fast code paths for common cases
(aligned, and single segmented) would be nice. One example is the
PyArray_Clip function (in multiarray.c), whose speed was too slow for
my taste in my use cases: the general idea is to get to the point
where you have a C single buffer at which point you can call a trivial
C function (fast_clip). As you can see, this is not trivial (~150
lines of C code): since the point is to go faster, you really want to
avoid copying things, which means you have to be pretty careful.

Speed-wise, it definitely worths it, though: I don't remember the
exact numbers, but in some cases (which are very common), it was a
several times speed.
>
> Apart from the fact that this is more or less pseudo-code and that the last
> step brings a behaviour change (which is intended, but probably not very
> welcome), am I overlooking a problem?
>
> > >  Similarly (but even less logical), copy() and even the
> > > constructor are affected (yes, I understand that copy() creates
> > > contiguous arrays, but shouldn't it respect/retain the order
> > > nevertheless?):
> >
> > I don't see why it is illogical: when you do a copy, you don't preserve
> > memory layout, and so a simple memcpy of the whole buffer is not possible.
>
> As I wrote above, I don't think this is good.  A fortran-order-contiguous
> array is still contiguous, and not inferior in any way to C-order arrays.
> So I actually expect copy() to return an array of unchanged order.
Maybe this behaviour was kept for compatibility with numarray ? If you
look at the docstring, it is said that copy may not return the same
order than its input. It kind of make sense to me: C order is the
default of many numpy operations.

David