[Numpy-discussion] Performance problems with strided arrays in NumPy

Thu Apr 20 09:42:04 EDT 2006

On Wed, Apr 19, 2006 at 09:39:01PM -0600, Travis Oliphant wrote:
>>On another hand, I see that you have disabled the optimization for
>>unaligned data through the use of a check. Is there any reason for
>>doing that?  If I remove this check, I can achieve similar performance
>>than for numarray (a bit better, in fact).
>
>The only reason was to avoid pointer dereferencing on misaligned data 
>(dereferencing a misaligned pointer causes bus errors on Solaris).   
>But, if we can achieve it with a memmove, then there is no reason to 
>limit the code.

I see. Well, I've tried out with memmove instead than memcpy, and I
can reproduce the same slowdown than it was seen previously to using
your pointer addressing optimisation. I'm afraid that Shasha was right
in that memmove check for not overwriting destination is the
responsible for this.

Having said that, and although I must admit that I don't know in deep
the different situations under which the source of a copy may overlap
the destination, my guess is that for typical element sizes (i.e. [1],
2, 4, 8 and 16) for which the optimization has been done, there is not
any harm on using memcpy instead of memmove (admittedly, you may come
with a counter-example of this, but I do hope you don't). In any case,
the use of memcpy is completely equivalent to the current optimization
using pointers except that, hopefully, pointer addressing is not made
on unaligned data. So, perhaps using the memcpy approach in Solaris
(under Sparc I guess) may avoid the bus errors. It would be nice if
anyone with access to such a platform can confirm this point. I'm
attaching a patch for current SVN numpy that uses the memcpy approach.
Feel free to try it against the benchmarks (also attached).

One last word, I've added a case for typesize 1 in addition of the
existing ones as this effectively improves the speed for 1-byte types.
Below are the speeds without the 1-byte case optimisation:

time for numpy contiguous --> 0.03
time for numarray contiguous --> 0.062
time for numpy strided (2) --> 0.078
time for numarray strided (2) --> 0.064
time for numpy strided (10) --> 0.081
time for numarray strided (10) --> 0.07

I haven't added a case for the unaligned case because this makes
non-sense for 1 byte sized types.

and here with the 1-byte case optimisation added:

time for numpy contiguous --> 0.03
time for numarray contiguous --> 0.062
time for numpy strided (2) --> 0.054
time for numarray strided (2) --> 0.065
time for numpy strided (10) --> 0.061
time for numarray strided (10) --> 0.07

you can notice an speed-up between a 30% and 45% over the previous
case.

Cheers,
-------------- next part --------------

--- numpy/core/src/arrayobject.c        (revision 2381)
+++ numpy/core/src/arrayobject.c        (working copy)
@@ -628,28 +628,44 @@
         intp i, j;
         char *tout = dst;
         char *tin = src;
+       /* For typical datasizes, the memcpy call is much faster than memmove
+          and perfectely safe */
         switch(elsize) {
+        case 16:
+                for (i=0; i<N; i++) {
+                        memcpy(tout, tin, 16);
+                        tin = tin + instrides;
+                        tout = tout + outstrides;
+                }
+                return;
         case 8:
                 for (i=0; i<N; i++) {
-                        ((Float64 *)tout)[0] = ((Float64 *)tin)[0];
+                        memcpy(tout, tin, 8);
                         tin = tin + instrides;
                         tout = tout + outstrides;
                 }
                 return;
         case 4:
                 for (i=0; i<N; i++) {
-                        ((Int32 *)tout)[0] = ((Int32 *)tin)[0];
+                        memcpy(tout, tin, 4);
                         tin = tin + instrides;
                         tout = tout + outstrides;
                 }
                 return;
         case 2:
                 for (i=0; i<N; i++) {
-                        ((Int16 *)tout)[0] = ((Int16 *)tin)[0];
+                        memcpy(tout, tin, 2);
                         tin = tin + instrides;
                         tout = tout + outstrides;
                 }
                 return;
+        case 1:
+                for (i=0; i<N; i++) {
+                        memcpy(tout, tin, 1);
+                        tin = tin + instrides;
+                        tout = tout + outstrides;
+                }
+                return;
         default:
                 for (i=0; i<N; i++) {
                         for (j=0; j<elsize; j++) {
@@ -731,8 +747,7 @@
         }

         /* See if we can iterate over the largest dimension */
-        if (!swap && PyArray_ISALIGNED(dest) && PyArray_ISALIGNED(src) &&
-            (nd = dest->nd) == src->nd && (nd > 0) &&
+        if (!swap && (nd = dest->nd) == src->nd && (nd > 0) &&
             PyArray_CompareLists(dest->dimensions, src->dimensions, nd)) {
                 int maxaxis=0, maxdim=dest->dimensions[0];
                 int i;

-------------- next part --------------
A non-text attachment was scrubbed...
Name: bench-copy.py
Type: text/x-python
Size: 2053 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20060420/08ed1d4d/attachment-0002.py>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bench-copy1.py
Type: text/x-python
Size: 1168 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20060420/08ed1d4d/attachment-0003.py>