[SciPy-dev] FFTW performances in scipy and numpy

Wed Aug 1 03:49:34 EDT 2007

Anne Archibald wrote:
> On 01/08/07, David Cournapeau <david at ar.media.kyoto-u.ac.jp> wrote:
>
>>     I am one of the contributor to numpy/scipy. Let me first say I am
>> *not* the main author of the fftw wrapping for scipy, and that I am a
>> relatively newcommer in scipy, and do not claim a deep understanding of
>> numpy arrays. But I have been thinking a bit on the problem since I am a
>> big user of fft and debugged some problems in the scipy code since.
>
Ok, I prepared a small package to test several strategies:

http://www.ar.media.kyoto-u.ac.jp/members/david/archives/fftdev.tbz2

By doing make test, it should build out of the box and run the tests (if 
you are on Linux, have gcc and fftw3, of course :) ). I did not even 
check whether the computation is OK (I just tested against memory 
problems under valgrind).

3 strategies are available:
    - Have a flag to check whether the given array is 16 bytes aligned, 
and conditionnally build plans using this info
    - Use FFTW_UNALIGNED, and do not care about alignement
    - Current strategy: copy.

The three strategies use FFTW_MEASURE, which I didn't do before, and may 
explain the wrong things I've said in my precedent email.

There are four binaries, two for the first strategy: in one case 
(testsimd), I use standard malloc, and in the second case, I force 
malloc to be 16 bytes aligned.

The results on my pentium 4, for size of 1024, and 5000 iterations:

========================================
16 bytes aligned malloc + FFTW_MEASURE
========================================
size is 1024, iter is 5000
testing cached (estimate)
         cycle is 780359600.00, 156071.92 per execution, min is 56816.00
countaligned is 5000
========================================
not aligned malloc + FFTW_MEASURE
========================================
size is 1024, iter is 5000
testing cached (estimate)
         cycle is 948300378.00, 189660.08 per execution, min is 88216.00
countaligned is 0
========================================
not aligned malloc + FFTW_MEASURE | FFTW_UNALIGNED
========================================
size is 1024, iter is 5000
testing cached (estimate)
         cycle is 1110396876.00, 222079.38 per execution, min is 87788.00
countaligned is 0
========================================
not aligned malloc + FFTW_MEASURE + copy
========================================
size is 1024, iter is 5000
testing cached (estimate)
         cycle is 1644872686.00, 328974.54 per execution, min is 219936.00
countaligned is 0
TESTING Done

min is what matters: it is the minum number of cycles within the 5000 
iterations.

On my pentium m, I got totally different results of course:

51000 cycles
55000 cycles
55000 cycles
150000 cycles

And also, those are almost always reproducible (do not change much 
between runs). I don't know if this is because of the Pentium 4 of my 
workstation, or because I have the home on NFS which may screw things up 
in a way I don't understand, but on my workstation, each run gives 
different results.

Basically, we should change the current scipy implementation now, as the 
third strategy is easy to implement, and is 3 times better than the 
current one :)
>
> Not just libraries; with SSE2 and related instruction sets, it's quite
> possible that even ufuncs could be radically accelerated - it's
> reasonable to use SIMD (and cache control!) for even the simple case
> of adding two arrays into a third. No code yet exists in numpy to do
> so, but an aggressive optimizing compiler could do something with the
> code that is there. (Of course, this has observable numerical effects,
> so there would be the same problem as for gcc's -ffast-math flag.)
The problem of precision really is specific to SSE and x86, right ? But 
since apple computers also use those now, I guess the problem is kind of 
pervasive :)
>
> Really large numpy arrays are already going to be SIMD-aligned (on
> Linux at least), because they are allocated on fresh pages. Small
> arrays are going to waste space if they're SIMD-aligned. So the
> default allocator is probably fine as it is, but it would be handy to
> have alignment as an additional property one could request from
> constructors and check from anywhere. I would hesitate to make it a
> flag, since one might well care about page alignment, 32-bit
> alignment, or whatever.
Are you sure about the page thing ? A page is 4kb, right ? This would 
mean any double numpy arrays above 512 items is aligned... which is not 
what I observed when I tested. Since I screwed things up last time I 
checked, I should test again, though.
>
> Really I suppose one can tell alignment pretty easily by looking at
> the pointer (does FFTW_ESTIMATE do this automatically?) 
This is trivial: just check whether your pointer address is a multiple 
of 16 (one line of code in my benchmark, in zfft_fftw3.c)

David