[Numpy-discussion] Unnecessarily bad performance of elementwise operators with Fortran-arrays

Fri Nov 9 23:36:34 EST 2007

On Nov 9, 2007 7:23 PM, David Cournapeau <cournape at gmail.com> wrote:

> On Nov 10, 2007 1:12 AM, Travis E. Oliphant <oliphant at enthought.com>
> wrote:
> > Hans Meine wrote:
> > > | static void
> > > | DOUBLE_add(char **args, intp *dimensions, intp *steps, void *func)
> > > | {
> > > |     register intp i;
> > > |     intp is1=steps[0],is2=steps[1],os=steps[2], n=dimensions[0];
> > > |     char *i1=args[0], *i2=args[1], *op=args[2];
> > > |     for(i=0; i<n; i++, i1+=is1, i2+=is2, op+=os) {
> > > |         *((double *)op)=*((double *)i1) + *((double *)i2);
> > > |     }
> > > | }
> > >
> > > If I understood David correctly, he suggested an unstrided variant of
> > > this inner loop, which simply uses pointer increments.
> > I'm not sure if he was saying that or not.  But, more generally, I'm all
> > for optimizations at this level.  But, they will have to be done
> > differently for different cases with this loop as the fall-back.
>
> I was not really clear, but yes, his was part of my argument: I don't
> think compilers can optimize the above well when there is no stride
> (more exactly when the stride could be done by simply using array
> indexing).
>
> This would need some benchmarks, but I have always read that using
> pointer arithmetics should be avoided when speed matters (e.g. *a + n
> * sizeof(*a) compared to a[n]), because it becomes much more difficult
> for the compiler to optimize, Generally, if you can get to a function
> which does the thing the "obvious way", this is better. Of course, you
> have to separate the case where this is possible and where it is not.
> But such work would also be really helpful if/when we optimize some
> basic things with MMX/SSE and co, and I think the above is impossible
> to auto vectorize (gcc 4.3, not released yet, gives some really
> helpful analysis for that, and can tell you when it fails to
> auto-vectorize, and why).
>

The relative speed of pointers vs indexing depends on a lot of things:

1) the architecture (registers, instruction sets, implementation)
2) the compiler
3) the code (number of base addresses, etc.)

My subjective feel is that on newer intel type hardware and using a recent
compiler, indexing is generally faster. But it does depend. I wrote an
indexed version of the code above:

typedef int intp;

void
DOUBLE_add_ptr(char **args, intp *dimensions, intp *steps, void *func)
{
    register intp i;
    intp is1=steps[0],is2=steps[1],os=steps[2], n=dimensions[0];
    char *i1=args[0], *i2=args[1], *op=args[2];
    for(i=0; i<n; i++, i1+=is1, i2+=is2, op+=os) {
        *((double *)op)=*((double *)i1) + *((double *)i2);
    }
}

void
DOUBLE_add_ind(char **args, intp *dimensions, intp *steps, void *func)
{
    const intp i1s=steps[0]/sizeof(double);
    const intp i2s=steps[1]/sizeof(double);
    const intp ios=steps[2]/sizeof(double);
    const n=dimensions[0];
    double * const p1 = (double*)args[0];
    double * const p2 = (double*)args[1];
    double * const po = (double*)args[2];
    intp i, io, i1, i2;

    for(i=0, io=0, i1=0, i2=0; i<n; ++i) {
        po[io] = p1[i1] + p2[i2];
        io += ios;
        i1 += i1s;
        i2 += i2s;
    }
}

Compiled with gcc -O2 -fno-strict-aliasing -march=prescott -m32 -S, the
inner loop of the pointer version looks like

.L4:
        fldl    (%edx)
        faddl   (%ecx)
        fstpl   (%eax)
        addl    $1, %ebx
        addl    -20(%ebp), %edx
        addl    -16(%ebp), %ecx
        addl    %edi, %eax
        cmpl    %esi, %ebx
        jne     .L4

While the indexed version looks like

.L11:
        movl    -24(%ebp), %esi
        fldl    (%esi,%ecx,8)
        movl    -20(%ebp), %esi
        faddl   (%esi,%edx,8)
        movl    -16(%ebp), %esi
        fstpl   (%esi,%eax,8)
        addl    -36(%ebp), %eax
        addl    -32(%ebp), %ecx
        addl    %edi, %edx
        addl    $1, %ebx
        cmpl    -28(%ebp), %ebx
        jne     .L11

Note that the indexed version is rather short of registers and has to load
all of the pointers and two of the indexes from the stack.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20071109/b3915751/attachment.html>