[SciPy-dev] FFTW performances in scipy and numpy
David Cournapeau
david at ar.media.kyoto-u.ac.jp
Thu Aug 2 00:54:54 EDT 2007
Steven G. Johnson wrote:
> On Wed, 1 Aug 2007, David Cournapeau wrote:
>> - making numpy data 16 bytes aligned. This one is a bit tricky. I
>> won't bother you with the details, but generally, numpy data may not
>> be even "word aligned". Since some archs require some kind of
>> alignement, there are some mechanisms to get aligned buffers from
>> unaligned buffers in numpy API; I was thinking about an additional
>> flag "SIMD alignement", since this could be quite useful for many
>> optimized libraries using SIMD. But maybe this does not make sense, I
>> have not yet thought enough about it to propose anything concrete to
>> the main numpy developers.
>
> One possibility might be to allocate *all* arrays over a certain size
> using posix_memalign, to get 16-byte alignnment, instead of malloc or
> whatever you are doing now, since if the array is big enough the
> overhead of alignment is negligible. It depends on the importance of
> FFTs, of course, although there may well be other things where 16-byte
> alignment will make SIMD optimization easier. I don't know anything
> about Python/NumPy's memory management, however, so I don't know
> whether this suggestion is feasible.
from my understanding of numpy (quite limited, I cannot emphasize this
enough), the data buffer can easily be made 16 bytes aligned, since they
are always allocated using a macro which is calling malloc, but could
replaced by posix_memalign.
But then, there is another problem: many numpy arrays are just than one
data buffer, are not always contiguous, etc... To interface with most C
libraries, this means that a copy has to be made somewhat if the arrays
are not contiguous or aligned; a whole infrastructure exists in numpy to
do exactly this, but I do not know it really well yet. Whether replacing
the current allocation with posix_memalign would enable the current
infrastructure to always return 16 bytes aligned buffers is unclear to
me, I should discuss this with people more knowledgeable than me.
>
> On x86, malloc is guaranteed to be 8-byte aligned, so barring
> conspiracy the probability of 16-byte alignment should be 50%. On
> x86_64, on the other hand, I've read that malloc is guaranteed to be
> 16-byte aligned (for allocations > 16 bytes) by the ABI, so you should
> have no worries in this case as long as Python uses malloc or similar.
Se below for my findings about this problem.
>> - I have tried FFTW_UNALIGNED + FFTW_ESTIMATE plans; unfortunately,
>> I found that the performances were worse than using FFTW_MEASURE +
>> copy (the copies are done into aligned buffers). I have since
>> discover that this may be due to the totally broken architecture of
>> my main workstation (a Pentium four): on my recent macbook (On linux,
>> 32 bits, CoreDuo2), using no copy with FFTW_UNALIGNED is much better.
>
> If you use FFTW_UNALIGNED, then essentially FFTW cannot use SIMD (with
> some exceptions). Whether a copy will be worth it for performance in
> this case will depend upon the problem size. For very large problems,
> I'm guessing it's probably not worth it, since SIMD has less of an
> impact there and cache has a bigger impact.
>
>> - The above problem is fixable if we add a mechanisme to choose
>> plans (ESTIMATE vs MEASURE vs ... I found that for 1d cases at least,
>> ESTIMATE vs MEASURE is what really count performance wise).
>
> One thing that might help is to build in wisdom at least for a few
> small power-of-two sizes, created at build time. e.g. wisdom for
> sizes 128, 256, 512, 1024, 2048, 4096, and 8192 is < 2kB.
>
> (We also need to work on the estimator heuristics...this has proven to
> be a hard problem in general, unfortunately.)
This is the way matlab works, right ? If I understand correctly, wisdoms
are a way to compute plans "offline". So for example, if you compute
plans with FFTW_MEASURE | FFTW_UNALIGNED, for inplace transforms and a
set of sizes, you can record it in a wisdom, and reload it such as later
calls with FFTW_MEASURE | FFTW_UNALIGNED will be fast ?
Anyway, all this sounds like it should be solved by adding a better
infrastructure the current wrappers (ala matlab).
> Your claim that numpy arrays are almost never 16-byte aligned strikes
> me as odd; if true, it means that NumPy is doing something terribly
> weird in its allocation.
>
> I just did a quick test, and on i386 with glibc (Debian GNU/Linux),
> malloc'ing 10000 arrays of double variables of random size, almost
> exactly 50% are 16-byte aligned as expected. And on x86_64, 100% are
> 16-byte aligned (because of the abovementioned ABI requirement).
Well, then I must be doing something wrong in my tests (I attached the
test program I use to test pointers returned by malloc). This always
return 1 for buffers allocated with fftw_malloc, but always 0 for
buffers allocated with malloc on my machine, which is quite similar to
yours (ubuntu, x86, glibc). Not sure what I am missing (surely something
obvious).
regards,
David
/*
* test.c: allocate buffers of random size, and count the ones which are
16 bytes aligned
*/
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <stddef.h>
#include <math.h>
#include <time.h>
#ifdef FORCE_ALIGNED_MALLOC
#include <fftw3.h>
#define MALLOC(size) fftw_malloc((size))
#define FREE(ptr) fftw_free((ptr))
#else
#define MALLOC(size) malloc((size))
#define FREE(ptr) free((ptr))
#endif
/* Return 1 if the pointer x is 16 bytes aligned */
#define CHECK_SIMD_ALIGNMENT(x) \
( ((ptrdiff_t)(x) & 0xf) == 0)
const size_t MAXMEM = 1 << 26;
int allocmem(size_t n);
int main(void)
{
int i;
int niter = 100000;
size_t s;
int acc = 0;
double mean = 0, min = 1e100, max = 0;
double r;
/* not really random, but enough for here */
srand48(time(0));
for(i = 0; i < niter; ++i) {
/* forcing the size to be in [1, MAXMEM] */
r = drand48() * MAXMEM;
s = floor(r);
if (s < 1) {
s = 1;
} else if (s > MAXMEM) {
s = MAXMEM;
}
/* computing max and min allocated size for summary */
mean += r;
if (max < s) {
max = s;
}
if (min > s) {
min = s;
}
/* allocating */
acc += allocmem(s);
}
fprintf(stdout, "ratio %d / %d; average size is %f (m %f, M %f)\n",
acc, niter, mean / niter, min, max);
return 0;
}
/*
* Alloc n bytes, and return 0 if the memory allocated with this size was 16
* bytes aligned.
*/
int allocmem(size_t n)
{
char* tmp;
int isal = 0;
tmp = MALLOC(n);
isal = CHECK_SIMD_ALIGNMENT(tmp);
/* Not sure this is useful, but this should guarantee that tmp is really
* allocated, and that the compiler is not getting too smart */
tmp[0] = 0; tmp[n-1] = 0;
FREE(tmp);
return isal;
}
More information about the SciPy-Dev
mailing list