[SciPy-dev] scipy.fft module slow for complex inputs when linked to fftw3

Fri Aug 18 06:01:17 EDT 2006

David M. Cooke wrote:
> On Fri, Aug 18, 2006 at 05:32:14PM +0900, David Cournapeau wrote:
>   
>> Hi there,
>>
>>     I noticed recently that when using the fft module of scipy, it is 
>> much slower (5-10 folds) than numpy for complex inputs (only in the 1d 
>> case) when linking to fftw3. This problem is reported on the ticket #1 
>> for scipy : http://projects.scipy.org/scipy/scipy/ticket/1
>>
>> I am not sure, because the code is a bit difficult to read, but it looks 
>> like in the case of complex input + fftw3, the plan is always recomputed 
>> for each call to zfft (file:zfft.c), whereas in the real case or in the 
>> complexe case + fftw2, the function drfft(file:drfft.c), called from 
>> zrfft (file:zrfft.c) is calling a plan which is cached. I am trying to 
>> see how the caching is done, but I am not sure I will have the time to 
>> make it work for fftw3.
>>     
>
> Well, for fftw3 it uses FFTW_ESTIMATE for the plan. So it does a cheap
> estimate of what it needs.
Well, it depends what you mean by cheap. If compared to FFTW_MEASURE, 
yes. But compared to pre-computing the plan, and then doing multiple 
fftw, then it is not cheap. Computing the fftw is negligeable compared 
to computing the plan !

I paste a c file which shows the difference, and which emulates (the 
function test_noncached) the scipy.fft module (emulates as I understand 
it): it gives the computation for a a complex array of 1024 elements, 
iterated 1000 times. First, the number of cycles for all iteration, then 
per fft on average, and min of all iteration.

The only difference between cached and non cached is that the plan is 
computed again for each iteration in the non cached case (as done by 
scipy.fft now in the case of begin linked with fftw3):

output:

testing cached
         cycle is 69973016, 69973 per execution, min is 66416
testing non cached
         cycle is 946186540, 946186 per execution, min is 918036

Code: (cycle.h is available here: http://www.fftw.org/cycle.h, and is 
the code used for the estimation of best code in plans by fftw3)

#include <stddef.h>
#include <stdlib.h>

#include <fftw3.h>

#include "cycle.h"

#define MALLOC(size) fftw_malloc((size))
#define FREE(ptr) fftw_free((ptr))

typedef struct {
    double  total;
    double  min;
} result;

result test_cached(size_t size, size_t niter);
result test_noncached(size_t size, size_t niter);

int main(void)
{
    size_t  niter   = 1e3;
    size_t  size    = 1024;

    result  res;

    printf("testing cached\n");
    res = test_cached(size, niter);
    printf("\t cycle is %d, %d per execution, min is %d\n",
            (size_t)res.total, (size_t)res.total/niter, (size_t)res.min);

    printf("testing non cached\n");
    res = test_noncached(size, niter);
    printf("\t cycle is %d, %d per execution, min is %d\n",
            (size_t)res.total, (size_t)res.total/niter, (size_t)res.min);

    fftw_cleanup();
    return 0;
}

result test_cached(size_t size, size_t niter)
{
    fftw_complex   *in, *out;
    fftw_plan       plan;
    ticks           t0, t1;
    double          res, min, acc;
    size_t          i, j;
    result          r;

    in  = MALLOC(sizeof(*in)*size);
    out = MALLOC(sizeof(*out)*size);

    plan    = fftw_plan_dft_1d(size, in, out, 1, FFTW_ESTIMATE);

    acc = 0;
    min = 1e100;
    for(i = 0; i < niter; ++i) {
        for(j = 0; j < size; ++j) {
            in[j][0]  = (0.5 - (double)rand() / RAND_MAX);
            in[j][1]  = 0.1*j;
        }
        t0  = getticks();
        fftw_execute(plan);
        t1  = getticks();
        res = elapsed(t1, t0);
        if (res < min) {
            min = res;
        }
        acc += res;
    }
    r.total = acc;
    r.min   = min;

    fftw_destroy_plan(plan);
    FREE(in);
    FREE(out);

    return r;
}

result test_noncached(size_t size, size_t niter)
{
    fftw_complex   *in, *out;
    fftw_plan       plan;
    ticks           t0, t1;
    double          res, min, acc;
    size_t          i, j;
    result          r;

    in  = MALLOC(sizeof(*in)*size);
    out = MALLOC(sizeof(*out)*size);

    acc = 0;
    min = 1e100;
    for(i = 0; i < niter; ++i) {
        for(j = 0; j < size; ++j) {
            in[j][0]  = (0.5 - (double)rand() / RAND_MAX);
            in[j][1]  = 0.1*j;
        }
        t0      = getticks();
        plan    = fftw_plan_dft_1d(size, in, out, 1, FFTW_ESTIMATE);
        fftw_execute(plan);
        fftw_destroy_plan(plan);
        t1  = getticks();
        res = elapsed(t1, t0);
        if (res < min) {
            min = res;
        }
        acc += res;
    }
    r.total = acc;
    r.min   = min;

    FREE(in);
    FREE(out);

    return r;
}

Compile by:

gcc -c test.c -o test.o -W -Wall
gcc -lfftw3 -lm -o test test.o