[SciPy-dev] scipy.fft module slow for complex inputs when linked to fftw3
David Cournapeau
david at ar.media.kyoto-u.ac.jp
Sat Aug 19 05:40:35 EDT 2006
> I tried a dirty hack using the function fftw_execute_dft, which executes
> a given plan for different arrays, given they have the same properties.
> The problem is that because of the obfuscated way fftpack is coded right
> now, it is difficult to track what is going on; I have a small test
> program which calls zfft directly from the module _fftpack.so, running
> it under valgrind shows no problem... So there is something going on in
> fftpack I don't understand.
>
> The other obvious thing is to copy the content of the array into a
> cached buffer, computing the fft on it, and recopying the result. This
> is stupid, but it is better than the current situation I think. I
> implemented this solution, the speed is much better, and tests succeed.
> Do you know how I am supposed to build a patch (I am not familiar with
> patch...)
I build a patch anyway; I am not sure the format is OK, as I am not
familiar with
patch/diff options:
Before patch (last numpy and scipy SVN), scipy.fftpack.test(10) gives:
Fast Fourier Transform
=================================================
| real input | complex input
-------------------------------------------------
size | scipy | numpy | scipy | numpy
-------------------------------------------------
100 | 0.17 | 0.14 | 1.93 | 0.11 (secs for 7000 calls)
1000 | 0.12 | 0.16 | 1.12 | 0.14 (secs for 2000 calls)
256 | 0.28 | 0.28 | 3.09 | 0.22 (secs for 10000 calls)
512 | 0.37 | 0.48 | 3.82 | 0.38 (secs for 10000 calls)
1024 | 0.05 | 0.09 | 0.53 | 0.07 (secs for 1000 calls)
2048 | 0.10 | 0.16 | 0.88 | 0.15 (secs for 1000 calls)
4096 | 0.08 | 0.17 | 0.75 | 0.17 (secs for 500 calls)
8192 | 0.20 | 0.53 | 1.39 | 0.47 (secs for 500 calls)
....
Inverse Fast Fourier Transform
===============================================
| real input | complex input
-----------------------------------------------
size | scipy | numpy | scipy | numpy
-----------------------------------------------
100 | 0.16 | 0.29 | 2.29 | 0.28 (secs for 7000 calls)
1000 | 0.13 | 0.27 | 1.22 | 0.26 (secs for 2000 calls)
256 | 0.29 | 0.57 | 3.43 | 0.51 (secs for 10000 calls)
512 | 0.41 | 0.84 | 4.14 | 0.77 (secs for 10000 calls)
1024 | 0.06 | 0.14 | 0.62 | 0.13 (secs for 1000 calls)
2048 | 0.10 | 0.26 | 0.91 | 0.25 (secs for 1000 calls)
4096 | 0.11 | 0.31 | 0.81 | 0.26 (secs for 500 calls)
8192 | 0.23 | 0.78 | 1.53 | 0.72 (secs for 500 calls)
.......
After patch (last numpy and scipy SVN), scipy.fftpack.test(10) gives:
Fast Fourier Transform
=================================================
| real input | complex input
-------------------------------------------------
size | scipy | numpy | scipy | numpy
-------------------------------------------------
100 | 0.17 | 0.14 | 0.12 | 0.11 (secs for 7000 calls)
1000 | 0.12 | 0.16 | 0.10 | 0.14 (secs for 2000 calls)
256 | 0.28 | 0.28 | 0.20 | 0.20 (secs for 10000 calls)
512 | 0.36 | 0.47 | 0.27 | 0.36 (secs for 10000 calls)
1024 | 0.05 | 0.08 | 0.05 | 0.07 (secs for 1000 calls)
2048 | 0.09 | 0.16 | 0.08 | 0.14 (secs for 1000 calls)
4096 | 0.10 | 0.17 | 0.10 | 0.16 (secs for 500 calls)
8192 | 0.23 | 0.53 | 0.23 | 0.47 (secs for 500 calls)
....
Inverse Fast Fourier Transform
===============================================
| real input | complex input
-----------------------------------------------
size | scipy | numpy | scipy | numpy
-----------------------------------------------
100 | 0.17 | 0.29 | 0.14 | 0.26 (secs for 7000 calls)
1000 | 0.13 | 0.27 | 0.15 | 0.25 (secs for 2000 calls)
256 | 0.29 | 0.54 | 0.27 | 0.50 (secs for 10000 calls)
512 | 0.38 | 0.81 | 0.43 | 0.73 (secs for 10000 calls)
1024 | 0.06 | 0.14 | 0.08 | 0.13 (secs for 1000 calls)
2048 | 0.09 | 0.24 | 0.13 | 0.23 (secs for 1000 calls)
4096 | 0.09 | 0.24 | 0.13 | 0.23 (secs for 500 calls)
8192 | 0.22 | 0.75 | 0.37 | 0.73 (secs for 500 calls)
This makes things much faster, particularly for small sizes (which is
logical considering the main cause of slowness is the building of plans).
http://projects.scipy.org/scipy/scipy/attachment/ticket/1/fftw3slow.patch
Paste below:
--- scipy/Lib/fftpack/src/zfft.c 2006-08-18 20:45:50.000000000 +0900
+++ scipy-new/Lib/fftpack/src/zfft.c 2006-08-18 21:00:26.000000000 +0900
@@ -35,9 +35,21 @@
#endif
#if defined WITH_FFTW3
-/*
- *don't cache anything
- */
+GEN_CACHE(zfftw,(int n,int d)
+ ,int direction;
+ fftw_plan plan;
+ fftw_complex* ptr;
+ ,((caches_zfftw[i].n==n) &&
+ (caches_zfftw[i].direction==d))
+ ,caches_zfftw[id].direction = d;
+ caches_zfftw[id].ptr = fftw_malloc(sizeof(fftw_complex)*(n));
+ caches_zfftw[id].plan = fftw_plan_dft_1d(n, caches_zfftw[id].ptr,
+ caches_zfftw[id].ptr,
+ (d>0?FFTW_FORWARD:FFTW_BACKWARD),
+ FFTW_ESTIMATE);
+ ,fftw_destroy_plan(caches_zfftw[id].plan);
+ fftw_free(caches_zfftw[id].ptr);
+ ,10)
#elif defined WITH_FFTW
/**************** FFTW2 *****************************/
GEN_CACHE(zfftw,(int n,int d)
@@ -73,6 +85,7 @@
destroy_zdjbfft_caches();
#endif
#ifdef WITH_FFTW3
+ destroy_zfftw_caches();
#elif defined WITH_FFTW
destroy_zfftw_caches();
#else
@@ -118,6 +131,7 @@
if (f==0)
#endif
#ifdef WITH_FFTW3
+ plan = caches_zfftw[get_cache_id_zfftw(n,direction)].plan;
#elif defined WITH_FFTW
plan = caches_zfftw[get_cache_id_zfftw(n,direction)].plan;
#else
@@ -147,11 +161,10 @@
} else
#endif
#ifdef WITH_FFTW3
- plan = fftw_plan_dft_1d(n, (fftw_complex*)ptr, (fftw_complex*)ptr,
- (direction>0?FFTW_FORWARD:FFTW_BACKWARD),
- FFTW_ESTIMATE);
- fftw_execute(plan);
- fftw_destroy_plan(plan);
+ ptrm = caches_zfftw[get_cache_id_zfftw(n,direction)].ptr;
+ memcpy(ptrm, ptr, sizeof(double)*2*n);
+ fftw_execute(plan);
+ memcpy(ptr, ptrm, sizeof(double)*2*n);
#elif defined WITH_FFTW
fftw_one(plan,(fftw_complex*)ptr,NULL);
#else
@@ -181,11 +194,10 @@
} else
#endif
#ifdef WITH_FFTW3
- plan = fftw_plan_dft_1d(n, (fftw_complex*)ptr, (fftw_complex*)ptr,
- (direction>0?FFTW_FORWARD:FFTW_BACKWARD),
- FFTW_ESTIMATE);
- fftw_execute(plan);
- fftw_destroy_plan(plan);
+ ptrm = caches_zfftw[get_cache_id_zfftw(n,direction)].ptr;
+ memcpy(ptrm, ptr, sizeof(double)*2*n);
+ fftw_execute(plan);
+ memcpy(ptr, ptrm, sizeof(double)*2*n);
#elif defined WITH_FFTW
fftw_one(plan,(fftw_complex*)ptr,NULL);
#else
This is really a quick/dirty hack, as I don't really know the mechanism.
For example, I am not sure if there is no memory leak (I test the new
function zfft through valgrind, though, with no memory leak reported).
There should be a way to avoid the two full copies, too, which makes for
a significant proportion of the computing time, I think, but this would
require a rewrite of the module, I guess.
David
More information about the SciPy-Dev
mailing list