[SciPy-dev] FFTW performances in scipy and numpy

Thu Aug 2 00:54:54 EDT 2007

Steven G. Johnson wrote:
> On Wed, 1 Aug 2007, David Cournapeau wrote:
>>   - making numpy data 16 bytes aligned. This one is a bit tricky. I 
>> won't bother you with the details, but generally, numpy data may not 
>> be even "word aligned". Since some archs require some kind of 
>> alignement, there are some mechanisms to get aligned buffers from 
>> unaligned buffers in numpy API; I was thinking about an additional 
>> flag "SIMD alignement", since this could be quite useful for many 
>> optimized libraries using SIMD. But maybe this does not make sense, I 
>> have not yet thought enough about it to propose anything concrete to 
>> the main numpy developers.
>
> One possibility might be to allocate *all* arrays over a certain size 
> using posix_memalign, to get 16-byte alignnment, instead of malloc or 
> whatever you are doing now, since if the array is big enough the 
> overhead of alignment is negligible. It depends on the importance of 
> FFTs, of course, although there may well be other things where 16-byte 
> alignment will make SIMD optimization easier.  I don't know anything 
> about Python/NumPy's memory management, however, so I don't know 
> whether this suggestion is feasible.
from my understanding of numpy (quite limited, I cannot emphasize this 
enough), the data buffer can easily be made 16 bytes aligned, since they 
are always allocated using a macro which is calling malloc, but could 
replaced by posix_memalign.

But then, there is another problem: many numpy arrays are just than one 
data buffer, are not always contiguous, etc... To interface with most C 
libraries, this means that a copy has to be made somewhat if the arrays 
are not contiguous or aligned; a whole infrastructure exists in numpy to 
do exactly this, but I do not know it really well yet. Whether replacing 
the current allocation with posix_memalign would enable the current 
infrastructure to always return 16 bytes aligned buffers is unclear to 
me, I should discuss this with people more knowledgeable than me.
>
> On x86, malloc is guaranteed to be 8-byte aligned, so barring 
> conspiracy the probability of 16-byte alignment should be 50%.  On 
> x86_64, on the other hand, I've read that malloc is guaranteed to be 
> 16-byte aligned (for allocations > 16 bytes) by the ABI, so you should 
> have no worries in this case as long as Python uses malloc or similar.
Se below for my findings about this problem.
>>   - I have tried FFTW_UNALIGNED + FFTW_ESTIMATE plans; unfortunately, 
>> I found that the performances were worse than using FFTW_MEASURE + 
>> copy (the copies are done into aligned buffers). I have since 
>> discover that this may be due to the totally broken architecture of 
>> my main workstation (a Pentium four): on my recent macbook (On linux, 
>> 32 bits, CoreDuo2), using no copy with FFTW_UNALIGNED is much better.
>
> If you use FFTW_UNALIGNED, then essentially FFTW cannot use SIMD (with 
> some exceptions).  Whether a copy will be worth it for performance in 
> this case will depend upon the problem size.  For very large problems, 
> I'm guessing it's probably not worth it, since SIMD has less of an 
> impact there and cache has a bigger impact.
>
>>   - The above problem is fixable if we add a mechanisme to choose 
>> plans (ESTIMATE vs MEASURE vs ... I found that for 1d cases at least, 
>> ESTIMATE vs MEASURE is what really count performance wise).
>
> One thing that might help is to build in wisdom at least for a few 
> small power-of-two sizes, created at build time.  e.g. wisdom for 
> sizes 128, 256, 512, 1024, 2048, 4096, and 8192 is < 2kB.
>
> (We also need to work on the estimator heuristics...this has proven to 
> be a hard problem in general, unfortunately.)
This is the way matlab works, right ? If I understand correctly, wisdoms 
are a way to compute plans "offline". So for example, if you compute 
plans with FFTW_MEASURE | FFTW_UNALIGNED, for inplace transforms and a 
set of sizes, you can record it in a wisdom, and reload it such as later 
calls with FFTW_MEASURE | FFTW_UNALIGNED will be fast ?

Anyway, all this sounds like it should be solved by adding a better 
infrastructure the current wrappers (ala matlab).

> Your claim that numpy arrays are almost never 16-byte aligned strikes 
> me as odd; if true, it means that NumPy is doing something terribly 
> weird in its allocation.
>
> I just did a quick test, and on i386 with glibc (Debian GNU/Linux), 
> malloc'ing 10000 arrays of double variables of random size, almost 
> exactly 50% are 16-byte aligned as expected.  And on x86_64, 100% are 
> 16-byte aligned (because of the abovementioned ABI requirement).
Well, then I must be doing something wrong in my tests (I attached the 
test program I use to test pointers returned by malloc). This always 
return 1 for buffers allocated with fftw_malloc, but always 0 for 
buffers allocated with malloc on my machine, which is quite similar to 
yours (ubuntu, x86, glibc). Not sure what I am missing (surely something 
obvious).

regards,

David

/*
 * test.c: allocate buffers of random size, and count the ones which are 
16 bytes aligned
 */

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <stddef.h>
#include <math.h>
#include <time.h>

#ifdef FORCE_ALIGNED_MALLOC
    #include <fftw3.h>
    #define MALLOC(size) fftw_malloc((size))
    #define FREE(ptr) fftw_free((ptr))
#else
    #define MALLOC(size) malloc((size))
    #define FREE(ptr) free((ptr))
#endif

/* Return 1 if the pointer x is 16 bytes aligned */
#define CHECK_SIMD_ALIGNMENT(x) \
      ( ((ptrdiff_t)(x) & 0xf) == 0)

const size_t MAXMEM  = 1 << 26;

int allocmem(size_t n);

int main(void)
{
    int i;
    int niter = 100000;
    size_t s;
    int acc = 0;
    double mean = 0, min = 1e100, max = 0;
    double r;

    /* not really random, but enough for here */
    srand48(time(0));
    for(i = 0; i < niter; ++i) {
        /* forcing the size to be in [1, MAXMEM] */
        r = drand48() * MAXMEM;
        s = floor(r);
        if (s < 1) {
            s = 1;
        } else if (s > MAXMEM) {
            s = MAXMEM;
        }
        /* computing max and min allocated size for summary */
        mean += r;
        if (max < s) {
            max = s;
        }
        if (min > s) {
            min = s;
        }
        /* allocating */
        acc += allocmem(s);
    }
    fprintf(stdout, "ratio %d / %d; average size is %f (m %f, M %f)\n",
            acc, niter, mean / niter, min, max);
    return 0;
}

/*
 * Alloc n bytes, and return 0 if the memory allocated with this size was 16
 * bytes aligned.
 */
int allocmem(size_t n)
{
    char* tmp;
    int isal = 0;

    tmp = MALLOC(n);
    isal = CHECK_SIMD_ALIGNMENT(tmp);
    /* Not sure this is useful, but this should guarantee that tmp is really
     * allocated, and that the compiler is not getting too smart */
    tmp[0] = 0; tmp[n-1] = 0;
    FREE(tmp);

    return isal;
}