[Numpy-discussion] NEP 38 - Universal SIMD intrinsics

Ralf Gommers ralf.gommers at gmail.com
Tue Feb 11 16:33:19 EST 2020


On Tue, Feb 11, 2020 at 12:03 PM Devulapalli, Raghuveer <
raghuveer.devulapalli at intel.com> wrote:

> >> I think this doesn't quite answer the question. If I understand
> correctly, it's about a single instruction (e.g. one needs "VEXP2PD" and
> it's missing from the  supported AVX512 instructions in master). I think
> the answer is yes, it needs to be added for other architectures as well.
>
>
>
> That adds a lot of overhead to write SIMD based optimizations which can
> discourage contributors.
>

Keep in mind that a new universal intrinsics instruction is just a bunch of
defines. That is way less work than writing a ufunc that uses that
instruction. We can also ping a platform expert in case it's not obvious
what the corresponding arch-specific instruction is - that's a bit of a
chicken-and-egg problem; once we get going we hopefully get more interested
people that can help each other out.


> It’s also an unreasonable expectation that a developer be familiar with
> SIMD of all the architectures. On top of that the performance implications
> aren’t clear. Software implementations of hardware instructions might
> perform worse and might not even produce the same result.
>

I think you are worrying about writing ufuncs here, not about adding an
instruction. If the same result is not produced, we have CI that should
fail - and if it does, we can deal with that by (if it's not easy to figure
out) making that platform fall back to the generic non-SIMD version of the
ufunc.

Cheers,
Ralf



>
>
> *From:* NumPy-Discussion <numpy-discussion-bounces+raghuveer.devulapalli=
> intel.com at python.org> *On Behalf Of *Ralf Gommers
> *Sent:* Monday, February 10, 2020 9:17 PM
> *To:* Discussion of Numerical Python <numpy-discussion at python.org>
> *Subject:* Re: [Numpy-discussion] NEP 38 - Universal SIMD intrinsics
>
>
>
>
>
>
>
> On Tue, Feb 4, 2020 at 2:00 PM Hameer Abbasi <einstein.edison at gmail.com>
> wrote:
>
> —snip—
>
>
>
> > 1) Once NumPy adds the framework and initial set of Universal Intrinsic,
> if contributors want to leverage a new architecture specific SIMD
> instruction, will they be expected to add software implementation of this
> instruction for all other architectures too?
>
>
>
> In my opinion, if the instructions are lower, then yes. For example, one
> cannot add AVX-512 without adding, for example adding AVX-256 and AVX-128
> and SSE*.  However, I would not expect one person or team to be an expert
> in all assemblies, so intrinsics for one architecture can be developed
> independently of another.
>
>
>
> I think this doesn't quite answer the question. If I understand correctly,
> it's about a single instruction (e.g. one needs "VEXP2PD" and it's
> missing from the supported AVX512 instructions in master). I think the
> answer is yes, it needs to be added for other architectures as well.
> Otherwise, if universal intrinsics are added ad-hoc and there's no
> guarantee that a universal instruction is available for all main supported
> platforms, then over time there won't be much that's "universal" about the
> framework.
>
>
>
> This is a different question though from adding a new ufunc
> implementation. I would expect accelerating ufuncs via intrinsics that are
> already supported to be much more common than having to add new intrinsics.
> Does that sound right?
>
>
>
>
> > 2) On whom does the burden lie to ensure that new implementations are
> benchmarked and shows benefits on every architecture? What happens if
> optimizing an Ufunc leads to improving performance on one architecture and
> worsens performance on another?
>
>
>
> This is slightly hard to provide a recipe for. I suspect it may take a
> while before this becomes an issue, since we don't have much SIMD code to
> begin with. So adding new code with benchmarks will likely show
> improvements on all architectures (we should ensure benchmarks can be run
> via CI, otherwise it's too onerous). And if not and it's not easily
> fixable, the problematic platform could be skipped so performance there is
> unchanged.
>
>
>
> Only once there's existing universal intrinsics and then they're tweaked
> will we have to be much more careful I'd think.
>
>
>
> Cheers,
>
> Ralf
>
>
>
>
>
>
>
> I would look at this from a maintainability point of view. If we are
> increasing the code size by 20% for a certain ufunc, there must be a
> domonstrable 20% increase in performance on any CPU. That is to say,
> micro-optimisation will be unwelcome, and code readability will be
> preferable. Usually we ask the submitter of the PR to test the PR with a
> machine they have on hand, and I would be inclined to keep this trend of
> self-reporting. Of course, if someone else came along and reported a
> performance regression of, say, 10%, then we have increased code by 20%,
> with only a net 5% gain in performance, and the PR will have to be reverted.
>
>
>
> —snip—
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20200211/e9cdb2b6/attachment.html>


More information about the NumPy-Discussion mailing list