[SciPy-dev] scipy.fft module slow for complex inputs when linked to fftw3

Sat Aug 19 05:40:35 EDT 2006

> I tried a dirty hack using the function fftw_execute_dft, which executes 
> a given plan for different arrays, given they have the same properties. 
> The problem is that because of the obfuscated way fftpack is coded right 
> now, it is difficult to track what is going on; I have a small test 
> program which calls zfft directly from the module _fftpack.so, running 
> it under valgrind shows no problem... So there is something going on in 
> fftpack I don't understand.
>
> The other obvious thing is to copy the content of the array into a 
> cached buffer, computing the fft on it, and recopying the result. This 
> is stupid, but it is better than the current situation I think. I 
> implemented this solution, the speed is much better, and tests succeed. 
> Do you know how I am supposed to build a patch (I am not familiar with 
> patch...)
I build a patch anyway; I am not sure the format is OK, as I am not 
familiar with
patch/diff options:

Before patch (last numpy and scipy SVN),  scipy.fftpack.test(10) gives:

                 Fast Fourier Transform
=================================================
      |    real input     |   complex input
-------------------------------------------------
 size |  scipy  |  numpy  |  scipy  |  numpy
-------------------------------------------------
  100 |    0.17 |    0.14 |    1.93 |    0.11  (secs for 7000 calls)
 1000 |    0.12 |    0.16 |    1.12 |    0.14  (secs for 2000 calls)
  256 |    0.28 |    0.28 |    3.09 |    0.22  (secs for 10000 calls)
  512 |    0.37 |    0.48 |    3.82 |    0.38  (secs for 10000 calls)
 1024 |    0.05 |    0.09 |    0.53 |    0.07  (secs for 1000 calls)
 2048 |    0.10 |    0.16 |    0.88 |    0.15  (secs for 1000 calls)
 4096 |    0.08 |    0.17 |    0.75 |    0.17  (secs for 500 calls)
 8192 |    0.20 |    0.53 |    1.39 |    0.47  (secs for 500 calls)
....
       Inverse Fast Fourier Transform
===============================================
      |     real input    |    complex input
-----------------------------------------------
 size |  scipy  |  numpy  |  scipy  |  numpy
-----------------------------------------------
  100 |    0.16 |    0.29 |    2.29 |    0.28  (secs for 7000 calls)
 1000 |    0.13 |    0.27 |    1.22 |    0.26  (secs for 2000 calls)
  256 |    0.29 |    0.57 |    3.43 |    0.51  (secs for 10000 calls)
  512 |    0.41 |    0.84 |    4.14 |    0.77  (secs for 10000 calls)
 1024 |    0.06 |    0.14 |    0.62 |    0.13  (secs for 1000 calls)
 2048 |    0.10 |    0.26 |    0.91 |    0.25  (secs for 1000 calls)
 4096 |    0.11 |    0.31 |    0.81 |    0.26  (secs for 500 calls)
 8192 |    0.23 |    0.78 |    1.53 |    0.72  (secs for 500 calls)
.......

After patch (last numpy and scipy SVN),  scipy.fftpack.test(10) gives:

                 Fast Fourier Transform
=================================================
      |    real input     |   complex input
-------------------------------------------------
 size |  scipy  |  numpy  |  scipy  |  numpy
-------------------------------------------------
  100 |    0.17 |    0.14 |    0.12 |    0.11  (secs for 7000 calls)
 1000 |    0.12 |    0.16 |    0.10 |    0.14  (secs for 2000 calls)
  256 |    0.28 |    0.28 |    0.20 |    0.20  (secs for 10000 calls)
  512 |    0.36 |    0.47 |    0.27 |    0.36  (secs for 10000 calls)
 1024 |    0.05 |    0.08 |    0.05 |    0.07  (secs for 1000 calls)
 2048 |    0.09 |    0.16 |    0.08 |    0.14  (secs for 1000 calls)
 4096 |    0.10 |    0.17 |    0.10 |    0.16  (secs for 500 calls)
 8192 |    0.23 |    0.53 |    0.23 |    0.47  (secs for 500 calls)
....
       Inverse Fast Fourier Transform
===============================================
      |     real input    |    complex input
-----------------------------------------------
 size |  scipy  |  numpy  |  scipy  |  numpy
-----------------------------------------------
  100 |    0.17 |    0.29 |    0.14 |    0.26  (secs for 7000 calls)
 1000 |    0.13 |    0.27 |    0.15 |    0.25  (secs for 2000 calls)
  256 |    0.29 |    0.54 |    0.27 |    0.50  (secs for 10000 calls)
  512 |    0.38 |    0.81 |    0.43 |    0.73  (secs for 10000 calls)
 1024 |    0.06 |    0.14 |    0.08 |    0.13  (secs for 1000 calls)
 2048 |    0.09 |    0.24 |    0.13 |    0.23  (secs for 1000 calls)
 4096 |    0.09 |    0.24 |    0.13 |    0.23  (secs for 500 calls)
 8192 |    0.22 |    0.75 |    0.37 |    0.73  (secs for 500 calls)

This makes things much faster, particularly for small sizes (which is 
logical considering the main cause of slowness is the building of plans).

http://projects.scipy.org/scipy/scipy/attachment/ticket/1/fftw3slow.patch

Paste below:

--- scipy/Lib/fftpack/src/zfft.c    2006-08-18 20:45:50.000000000 +0900
+++ scipy-new/Lib/fftpack/src/zfft.c    2006-08-18 21:00:26.000000000 +0900
@@ -35,9 +35,21 @@
 #endif
 
 #if defined WITH_FFTW3
-/*
- *don't cache anything
- */
+GEN_CACHE(zfftw,(int n,int d)
+      ,int direction;
+       fftw_plan plan;
+       fftw_complex* ptr;
+      ,((caches_zfftw[i].n==n) &&
+        (caches_zfftw[i].direction==d))
+      ,caches_zfftw[id].direction = d;
+       caches_zfftw[id].ptr = fftw_malloc(sizeof(fftw_complex)*(n));
+       caches_zfftw[id].plan = fftw_plan_dft_1d(n, caches_zfftw[id].ptr,
+        caches_zfftw[id].ptr,
+        (d>0?FFTW_FORWARD:FFTW_BACKWARD),
+        FFTW_ESTIMATE);
+      ,fftw_destroy_plan(caches_zfftw[id].plan);
+       fftw_free(caches_zfftw[id].ptr);
+      ,10)
 #elif defined WITH_FFTW
 /**************** FFTW2 *****************************/
 GEN_CACHE(zfftw,(int n,int d)
@@ -73,6 +85,7 @@
   destroy_zdjbfft_caches();
 #endif
 #ifdef WITH_FFTW3
+  destroy_zfftw_caches();
 #elif defined WITH_FFTW
   destroy_zfftw_caches();
 #else
@@ -118,6 +131,7 @@
   if (f==0)
 #endif
 #ifdef WITH_FFTW3
+    plan = caches_zfftw[get_cache_id_zfftw(n,direction)].plan;
 #elif defined WITH_FFTW
     plan = caches_zfftw[get_cache_id_zfftw(n,direction)].plan;
 #else
@@ -147,11 +161,10 @@
       } else
 #endif
 #ifdef WITH_FFTW3
-    plan = fftw_plan_dft_1d(n, (fftw_complex*)ptr, (fftw_complex*)ptr,
-                (direction>0?FFTW_FORWARD:FFTW_BACKWARD),
-                FFTW_ESTIMATE);
-        fftw_execute(plan);
-    fftw_destroy_plan(plan);
+    ptrm    = caches_zfftw[get_cache_id_zfftw(n,direction)].ptr;
+    memcpy(ptrm, ptr, sizeof(double)*2*n);
+    fftw_execute(plan);
+    memcpy(ptr, ptrm, sizeof(double)*2*n);
 #elif defined WITH_FFTW
         fftw_one(plan,(fftw_complex*)ptr,NULL);
 #else
@@ -181,11 +194,10 @@
       } else
 #endif
 #ifdef WITH_FFTW3
-    plan = fftw_plan_dft_1d(n, (fftw_complex*)ptr, (fftw_complex*)ptr,
-                (direction>0?FFTW_FORWARD:FFTW_BACKWARD),
-                FFTW_ESTIMATE);
-        fftw_execute(plan);
-    fftw_destroy_plan(plan);
+    ptrm    = caches_zfftw[get_cache_id_zfftw(n,direction)].ptr;
+    memcpy(ptrm, ptr, sizeof(double)*2*n);
+    fftw_execute(plan);
+    memcpy(ptr, ptrm, sizeof(double)*2*n);
 #elif defined WITH_FFTW
         fftw_one(plan,(fftw_complex*)ptr,NULL);
 #else

This is really a quick/dirty hack, as I don't really know the mechanism. 
For example, I am not sure if there is no memory leak (I test the new 
function zfft through valgrind, though, with no memory leak reported). 
There should be a way to avoid the two full copies, too, which makes for 
a significant proportion of the computing time, I think, but this would 
require a rewrite of the module, I guess.

David