[Numpy-discussion] NEP 38 - Universal SIMD intrinsics

Thu Feb 13 12:27:13 EST 2020

On Wed, Feb 12, 2020 at 1:37 PM Devulapalli, Raghuveer <
raghuveer.devulapalli at intel.com> wrote:

> >> I hope there will not be a demand to use many non-universal intrinsics
> in ufuncs, we will need to work this out on a case-by-case basis in each
> ufunc. Does that sound reasonable? Are there intrinsics you have already
> used that have no parallel on other platforms?
>
> I think that is reasonable. It's hard to anticipate the future need and
> benefit of specialized intrinsics but I tried to make a list of some of the
> specialized intrinsics that are currently in use in NumPy that I don’t
> believe exist on other platforms (most of these actually don’t exist on
> AVX2 either). I am not an expert in ARM or VSX architecture, so please
> correct me if I am wrong.
>
> a. _mm512_mask_i32gather_ps
> b. _mm512_mask_i32scatter_ps/_mm512_mask_i32scatter_pd
> c. _mm512_maskz_loadu_pd/_mm512_maskz_loadu_ps
> d. _mm512_getexp_ps
> e. _mm512_getmant_ps
> f. _mm512_scalef_ps
> g. _mm512_permutex2var_ps, _mm512_permutex2var_pd
> h. _mm512_maskz_div_ps, _mm512_maskz_div_pd
> i. _mm512_permute_ps/_mm512_permute_pd
> j. _mm512_sqrt_ps/pd (I could be wrong on this one, but from the little
> google search I did, it seems like power ISA doesn’t have a vectorized sqrt
> instruction)
>
> Software implementations of these instructions is definitely possible. But
> some of them are not trivial to implement and are surely not going to be
> one line macro's either. I am also unsure of what implications this has on
> performance, but we will hopefully find out once we convert these to
> universal intrinsic and then benchmark.
>

For these it seems like we don't want software implementations of the
universal intrinsics - if there's no equivalent on PPC/ARM and there's
enough value (performance gain given additional code complexity) in the
additional AVX instructions, then we should still simply use AVX
instructions directly.

Ralf

> Raghuveer
>
> -----Original Message-----
> From: NumPy-Discussion <numpy-discussion-bounces+raghuveer.devulapalli=
> intel.com at python.org> On Behalf Of Matti Picus
> Sent: Tuesday, February 11, 2020 11:19 PM
> To: numpy-discussion at python.org
> Subject: Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics
>
> On 11/2/20 8:02 pm, Devulapalli, Raghuveer wrote:
> >
> > On top of that the performance implications aren’t clear. Software
> > implementations of hardware instructions might perform worse and might
> > not even produce the same result.
> >
>
> The proposal for universal intrinsics does not enable replacing an
> intrinsic on one platform with a software emulation on another: the
> intrinsics are meant to be compile-time defines that overlay the universal
> intrinsic with a platform specific one. In order to use a new intrinsic, it
> must have parallel intrinsics on the other platforms, or cannot be used
> there: "NPY_CPU_HAVE(FEATURE_NAME)" will always return false so the
> compiler will not even build a loop for that platform. I will try to
> clarify that intention in the NEP.
>
>
> I hope there will not be a demand to use many non-universal intrinsics in
> ufuncs, we will need to work this out on a case-by-case basis in each
> ufunc. Does that sound reasonable? Are there intrinsics you have already
> used that have no parallel on other platforms?
>
>
> Matti
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20200213/49234563/attachment.html>