[SciPy-User] Single precision FFT insufficiently accurate.

Mon Jun 28 10:01:43 EDT 2010

Den 28.06.2010 08:05, skrev Anne Archibald:
> The performance numbers in the blog post I linked to show that padding
> to the next larger power of two is substantially slower than padding
> to a number with a more complex factorization, though less padding and
> more complexity slows it back down a little.

That's something FFTW exploits to be fast. It tries various 
factorizations (and paddings?) and "learns" the fastest. Two other 
factors that play a role here is data alignment and cache use. It's not 
just the flop count that matters.

At least we should use always buffers aligned to 16 byte boundaries (or 
a product of 16), so the compiler can be allowed generate simd code 
(MMX, SSE/SSE2, altivec). We can tell that to the C compiler using 
__declspec(align(16)) on arrays, possibly with C99 restrict qualifier as 
well.

The differences between these matter a lot to the C compiler:

__declspec(align(16)) const double *restrict array; // free to vectorize 
at will

double *array; // aliased? aligned? who can tell?

Aligning buffers in NumPy is easy:

def aligned_buffer(shape, boundary=4096, dtype=np.float64, order='C'):
     N,d = np.prod(shape), np.dtype(dtype)
     tmp = np.empty(N * d.itemsize + boundary, dtype=np.uint8)
     address = tmp.__array_interface__['data'][0]
     offset = (boundary - address % boundary) % boundary
     return tmp[offset:offset+N]\
             .view(dtype=d)\
             .reshape(shape, order=order)

I don't think this is too important though. I usually begin with the 
smallest possible FFT size, and increment the padding by one until I get 
a product of 2, 3 or 5. I have yet to see a case where this has not been 
sufficient for my work. Being a little bit practical is affordable too. :-)

Sturla
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20100628/ed2be0a3/attachment.html>