[Numpy-discussion] adding fused multiply and add to numpy

Wed Jan 8 17:09:58 EST 2014

On Wed, Jan 8, 2014 at 2:39 PM, Julian Taylor <jtaylor.debian at googlemail.com
> wrote:

> Hi,
> Since AMDs bulldozer and Intels Haswell x86 cpus now also support the
> fused-multiply-and-add operation in hardware.
>
> http://en.wikipedia.org/wiki/Multiply–accumulate_operation
>
> This operation is interesting for numpy for two reasons:
>  - Only one rounding is involved in two arithmetic operations, this is
> good reducing rounding errors.
>  - Two operations are done in one memory pass, so it improves the
> performance if ones operations are bound by the memory bandwidth which
> is very common in numpy.
>
> I have done some experiments using a custom ufunc:
> https://github.com/juliantaylor/npufuncs
>
> See the README.md on how to try it out. It requires a recent GCC
> compiler, at least 4.7 I think.
>
> It contains SSE2, AVX, FMA3 (AVX2), FMA4 and software emulation
> variants. Edit the file to select which one to use. Note if the cpu does
> not support the instruction it will just crash.
> Only the latter three are real FMA operations, the SSE2 and AVX variants
> just perform two regular operations in one loop.
>
> My current machine only supports SSE2, so here are the timings for it:
>
> In [25]: a = np.arange(500000.)
> In [26]: b = np.arange(500000.)
> In [27]: c = np.arange(500000.)
>
> In [28]: %timeit npfma.fma(a, b, c)
> 100 loops, best of 3: 2.49 ms per loop
>
> In [30]: def pure_numpy_fma(a,b,c):
>    ....:     return a * b + c
>
> In [31]: %timeit pure_numpy_fma(a, b, c)
> 100 loops, best of 3: 7.36 ms per loop
>
>
> In [32]: def pure_numpy_fma2(a,b,c):
>    ....:     tmp = a *b
>    ....:     tmp += c
>    ....:     return tmp
>
> In [33]: %timeit pure_numpy_fma2(a, b, c)
> 100 loops, best of 3: 3.47 ms per loop
>
>
> As you can see even without real hardware support it is about 30% faster
> than inplace unblocked numpy due better use of memory bandwidth. Its
> even more than two times faster than unoptimized numpy.
>
> If you have a machine capable of fma instructions give it a spin to see
> if you get similar or better results. Please verify the assembly
> (objdump -d fma-<suffix>.o) to check if the compiler properly used the
> machine fma.
>
> An issue is software emulation of real fma. This can be enabled in the
> test ufunc with npfma.set_type("libc").
> This is unfortunately incredibly slow about a factor 300 on my machine
> without hardware fma.
> This means we either have a function that is fast on some platforms and
> slow on others but always gives the same result or we have a fast
> function that gives better results on some platforms.
> Given that we are not worth that what numpy currently provides I favor
> the latter.
>
> Any opinions on whether this should go into numpy or maybe stay a third
> party ufunc?
>

Multiply and add is a standard function that I think would be good to have
in numpy. Not only does it save on memory accesses, it saves on temporary
arrays.

Another function that could be useful is a |a|**2 function, abs2 perhaps.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140108/bbf84266/attachment.html>