[SciPy-User] Single precision FFT insufficiently accurate.
Sturla Molden
sturla at molden.no
Mon Jun 28 10:01:43 EDT 2010
Den 28.06.2010 08:05, skrev Anne Archibald:
> The performance numbers in the blog post I linked to show that padding
> to the next larger power of two is substantially slower than padding
> to a number with a more complex factorization, though less padding and
> more complexity slows it back down a little.
That's something FFTW exploits to be fast. It tries various
factorizations (and paddings?) and "learns" the fastest. Two other
factors that play a role here is data alignment and cache use. It's not
just the flop count that matters.
At least we should use always buffers aligned to 16 byte boundaries (or
a product of 16), so the compiler can be allowed generate simd code
(MMX, SSE/SSE2, altivec). We can tell that to the C compiler using
__declspec(align(16)) on arrays, possibly with C99 restrict qualifier as
well.
The differences between these matter a lot to the C compiler:
__declspec(align(16)) const double *restrict array; // free to vectorize
at will
double *array; // aliased? aligned? who can tell?
Aligning buffers in NumPy is easy:
def aligned_buffer(shape, boundary=4096, dtype=np.float64, order='C'):
N,d = np.prod(shape), np.dtype(dtype)
tmp = np.empty(N * d.itemsize + boundary, dtype=np.uint8)
address = tmp.__array_interface__['data'][0]
offset = (boundary - address % boundary) % boundary
return tmp[offset:offset+N]\
.view(dtype=d)\
.reshape(shape, order=order)
I don't think this is too important though. I usually begin with the
smallest possible FFT size, and increment the padding by one until I get
a product of 2, 3 or 5. I have yet to see a case where this has not been
sufficient for my work. Being a little bit practical is affordable too. :-)
Sturla
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20100628/ed2be0a3/attachment.html>
More information about the SciPy-User
mailing list