[Numpy-discussion] f2py performance question from a rookie

Mon Aug 16 18:01:29 EDT 2010

> After using time.clock, running f2py with the REPORT_ON_ARRAY_COPY enabled
> and
> passing arrays as np.asfortranarray(array) to the fortran routines I still
> get a
> slow performance on f2py. No copied arrays are reported.

That is not any better as np.asfortranarray will make a copy instead. Pass
the transpose to Fortran, and transpose the return value. Make sure you
neither make a copy before or after calling Fortran. A transpose does not
make a copy in NumPy.

You want C ordering in NumPy and Fortran ordering in Fortran. Which means
you probably want to call Fortran like this:

    foobar(x.T, y.T)

If you let f2py return y as a result, make sure it is C ordered after
transpose:

    y = foobar(x.T).T
    assert(y.flags['C_CONTIGUOUS'])

Even if you don't get a "copy and transpose" immediately (i.e. f2py does
not report a copy), it might happen silently later on if you get Fortran
ordered arrays returned into Python.

> Running on f2py with a 6499x6499 array takes about 1.2sec while the python
> for
> loop does the job slightly faster at 0.9 sec.

I suppose you are including the solve time here? In which case the
dominating factors are np.dot and np.linalg.solve. Fortran will not help a
lot then.

Well if you have access to performance libraries (ACML or MKL) you might
save a little by using Fortran DGEMM instead of np.dot and a carefully
selected Fortran LAPACK solver (that is, np.linalg.solve use Gaussian
substitution, i.e. LU, but there might be structure in your matrices to
exploit, I haven't looked too carefully.)

> subroutine fill_B(j, beta, lamda, mu, b)
> integer, intent(in) ::  j
> integer :: row, col
> real(kind = 8),  intent(in) :: beta(j), lamda(j), mu(j)
> real(kind = 8), intent(out) :: b(j,j)
>
> do row=2,j-1
>     do col=1,j
>         if (col == row-1) then
>             b(row,col) = beta(row) - lamda(row) + mu(row)
>         elseif (col == row) then
>             b(row,col) = 1 - 2*beta(row)
>         elseif (col == row+1) then
>             b(row, col) = beta(row) + lamda(row) - mu(row)
>         else
>             b(row, col) = 0
>         endif
>     enddo
> enddo
> b(1,1) = 1
> b(j,j) = 1
>
> end

That iterates over strided memory in Fortran. Memory access is slow,
computation is fast.

So there are two problems with your Fortran code:

1. You want b to be transposed in Fortran compared to Python.
2. Iterate in column major order in Fortran, not row major order.

Sturla