[Numpy-discussion] Performance problems with strided arrays in NumPy

Wed Apr 19 19:46:03 EDT 2006

On Wed, Apr 19, 2006 at 09:48:14PM +0000, faltet at xot.carabos.com wrote:
> On Tue, Apr 18, 2006 at 09:01:54PM -0600, Travis Oliphant wrote:
> > Apparently,  it is *much* faster to do
> > 
> > ((double *)dst)[0] = ((double *)src)[0]
> > 
> > when you have aligned data than it is to do
> > 
> > memmove(dst, src, sizeof(double))
> 
> Mmm.. very interesting.

A follow-up on this.  After analyzing somewhat the issue, it seems that
the problem with the memcpy() version was not the call itself, but the
parameter that was passed as the number of bytes to copy. As this was a
parameter whose value was unknown in compile time, the compiler cannot
generate optimized code for it and always has to fetch its value from
memory (or cache).

In the version of the code that you optimized, you managed to do this
because you are telling to the compiler (i.e. specifying at compile
time) the exact extend of the data copy, so allowing it to generate
optimum code for the copy operation. However, if you do a similar
thing but using the call (using doubles here):

memcpy(tout, tin, 8);

instead of:

((Float64 *)tout)[0] = ((Float64 *)tin)[0];

and repeat the operation for the other types, then you can achieve
similar performance than the pointer version.

On another hand, I see that you have disabled the optimization for
unaligned data through the use of a check. Is there any reason for
doing that?  If I remove this check, I can achieve similar performance
than for numarray (a bit better, in fact).

I'm attaching a small benchmark script that compares the performance
of copying a 1D vector of 1 million of elements in contiguous, strided
(2 and 10), and strided (2 and 10 again) & unaligned flavors. The
results for my machine (p4 at 2 GHz) are:

For the original numpy code (i.e. before Travis optimization):

time for numpy contiguous --> 0.234
time for numarray contiguous --> 0.229
time for numpy strided (2) --> 1.605
time for numarray strided (2) --> 0.263
time for numpy strided (10) --> 1.72
time for numarray strided (10) --> 0.264
time for numpy strided (2) & unaligned--> 1.736
time for numarray strided (2) & unaligned--> 0.402
time for numpy strided (10) & unaligned--> 1.872
time for numarray strided (10) & unaligned--> 0.435

where you can see that, for 1e6 elements the slowdown of original
numpy is almost 7x (!). Remember that in the previous benchmarks sent
here the slowdown was 3x, but we were copying 10 times less data.

For the pointer optimised code (i.e. the current SVN version):

time for numpy contiguous --> 0.238
time for numarray contiguous --> 0.232
time for numpy strided (2) --> 0.214
time for numarray strided (2) --> 0.264
time for numpy strided (10) --> 0.299
time for numarray strided (10) --> 0.262
time for numpy strided (2) & unaligned--> 1.736
time for numarray strided (2) & unaligned--> 0.401
time for numpy strided (10) & unaligned--> 1.874
time for numarray strided (10) & unaligned--> 0.433

here you can see that your figures are very similar to numarray except
for unaligned data (4x slower).

For the pointer optimised code but releasing the unaligned data check:

time for numpy contiguous --> 0.236
time for numarray contiguous --> 0.231
time for numpy strided (2) --> 0.213
time for numarray strided (2) --> 0.262
time for numpy strided (10) --> 0.297
time for numarray strided (10) --> 0.261
time for numpy strided (2) & unaligned--> 0.263
time for numarray strided (2) & unaligned--> 0.403
time for numpy strided (10) & unaligned--> 0.452
time for numarray strided (10) & unaligned--> 0.432

Ei! numpy is very similar to numarray in all cases, except for the
strided with 2 elements and unaligned case, where numpy performs a 50%
better.

Finally, and just for showing the effect of providing memcpy with size
information in compilation time, the numpy code using memcpy() with
this optimization on (and disabling the alignment check, of course!):

time for numpy contiguous --> 0.234
time for numarray contiguous --> 0.233
time for numpy strided (2) --> 0.223
time for numarray strided (2) --> 0.262
time for numpy strided (10) --> 0.285
time for numarray strided (10) --> 0.262
time for numpy strided (2) & unaligned--> 0.261
time for numarray strided (2) & unaligned--> 0.401
time for numpy strided (10) & unaligned--> 0.42
time for numarray strided (10) & unaligned--> 0.436

you can see that the figures are very similar to the previous case. So
Travis, you may want to use the pointer indirection approach or the
memcpy() one, whichever you prefer.

Well, I just wanted to point this out. Time for sleep!

Francesc
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bench-copy.py
Type: text/x-python
Size: 2054 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20060419/87659d97/attachment.py>