[Numpy-discussion] Rookie problems - Why is C-code much faster?

Tue Feb 21 10:25:04 EST 2006

On Tue, 21 Feb 2006, Zachary Pincus wrote:

> Mads,
>
> The game with numpy, just as it is with Matlab or any other
> interpreted numeric environment, is to try push as much of the
> looping down into the C code as you can. This is because, as you now
> know, compiled C can loop much faster than interpreted python.
>
> A simple example for averaging 1000 (x,y,z) points:
>
> print data.shape
> (1000, 3)
> # bad: explicit for loop in python
> avg = numpy.zeros(3, numpy.float_)
> for i in data: avg += i
> avg /= 1000.0
>
> # good: implicit for loop in C
> avg = numpy.add.reduce(data, axis = 0)
> avg /= 1000.0
>
> In your case, instead of explicitly looping through each point, why
> not do the calculations in parallel, operating on entire vectors of
> points at one time? Then the looping is "pushed down" into compiled C
> code. Or if you're really lucky, it's pushed all the way down to the
> vector math units on your cpu if you have a good BLAS or whatever
> installed.
>
> Zach Pincus
>
> Program in Biomedical Informatics and Department of Biochemistry
> Stanford University School of Medicine
>
>
>
> > 2. Here is the main loop for finding all possible pair distances,
> >    which corresponds to a loop over the upper triangular part of a
> >    square matrix
> >
> >      # Loop over all particles
> >      for i in range(n-1):
> >         dx = x[i+1:] - x[i]
> >         dy = y[i+1:] - y[i]
> >
> >         dx -= box*rint(dx/box)
> >         dy -= box*rint(dy/box)
> >
> >         r2 = dx**2 + dy**2      # square of dist. between points
> >
> > where x and y contain the positions of the particles. A naive
> > implementation in C is
> >
> >
> >     // loop over all particles
> >     for (int i=0; i<n-1; i++) {
> >         for (int j=i+1; j<n; j++) {
> >             dx = x[j] - x[i];
> >             dy = y[j] - y[i];
> >
> >             dx -= box*rint(dx/box);
> >             dy -= box*rint(dy/box);
> >
> >             r2 = dx*dx + dy*dy;
> >         }
> >     }
> >
> > For n = 2500 particles, i.e. 3123750 particle pairs, the C loop is
> > app. 10 times faster than the Python/Numeric counterpart. This is of
> > course not satisfactory.
> >
> > Are there any things I am doing completely wrong here, basic
> > approaches completely misunderstood, misuses etc?
> >
> > Any suggestions, guidelines, hints are most welcome.
> >
> > Best regards,
> >
> > Mads Ipsen
> >
> >
> > +---------------------------------+-------------------------+
> > | Mads Ipsen                      |                         |
> > | Dept. of Chemistry              | phone:     +45-35320220 |
> > | H.C.Ørsted Institute            | fax:       +45-35320322 |
> > | Universitetsparken 5            |                         |
> > | DK-2100 Copenhagen Ø, Denmark   | mpi at osc.kiku.dk         |
> > +---------------------------------+-------------------------+
> >
> >
> > -------------------------------------------------------
> > This SF.net email is sponsored by: Splunk Inc. Do you grep through
> > log files
> > for problems?  Stop!  Download the new AJAX search engine that makes
> > searching your log files as easy as surfing the  web.  DOWNLOAD
> > SPLUNK!
> > http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642
> > _______________________________________________
> > Numpy-discussion mailing list
> > Nump

I agree completely with your comments. But, as you can, the innermost part
of the loop has been removed in the code, and replaced with numpy slices.
It's hard for me to see how to compress the outer loop as well, since it
determines the ranges for the inner loop. Unless there is some fancy slice
notation, that allows you to loop over a triangular part of a matrix, ie.

    x[i] = sum(A[i+1,i])

meaning x[i] = sum of elements in i'th row of A using only elements from
position i+1 up to n.

Of course, there is the possibility of hardcoding this in C and then make
it available as a Python module. But I don't want to do this before I am
sure there isn't a numpy way out this.

Let me know, if you have any suggestions.

// Mads