[Numpy-discussion] Experimental `like=` attribute for array creation functions

Mon Aug 17 15:55:30 EDT 2020

As per discussed, I've opened a PR
https://github.com/numpy/numpy/pull/17093 attempting to clarify some
of the writing and to follow the NEP Template. As suggested in the
template, please find below the top part of NEP-35 (up to and
including the Backward Compatibility section). Please feel free to
comment and suggest improvements or point out what may still be
unclear, personally I would prefer comments directly on the PR if
possible.

===========================================================
NEP 35 — Array Creation Dispatching With __array_function__
===========================================================

:Author: Peter Andreas Entschev <pentschev at nvidia.com>
:Status: Draft
:Type: Standards Track
:Created: 2019-10-15
:Updated: 2020-08-17
:Resolution:

Abstract
--------

We propose the introduction of a new keyword argument ``like=`` to all array
creation functions, this argument permits the creation of an array based on
a non-NumPy reference array passed via that argument, resulting in an array
defined by the downstream library implementing that type, which also implements
the ``__array_function__`` protocol. With this we address one of that
protocol's shortcomings, as described by NEP 18 [1]_.

Motivation and Scope
--------------------

Many are the libraries implementing the NumPy API, such as Dask for graph
computing, CuPy for GPGPU computing, xarray for N-D labeled arrays, etc. All
the libraries mentioned have yet another thing in common: they have also adopted
the ``__array_function__`` protocol. The protocol defines a mechanism allowing a
user to directly use the NumPy API as a dispatcher based on the input array
type. In essence, dispatching means users are able to pass a downstream array,
such as a Dask array, directly to one of NumPy's compute functions, and NumPy
will be able to automatically recognize that and send the work back to Dask's
implementation of that function, which will define the return value. For
example:

.. code:: python

    x = dask.array.arange(5)    # Creates dask.array
    np.sum(a)                   # Returns dask.array

Note above how we called Dask's implementation of ``sum`` via the NumPy
namespace by calling ``np.sum``, and the same would apply if we had a CuPy
array or any other array from a library that adopts ``__array_function__``.
This allows writing code that is agnostic to the implementation library, thus
users can write their code once and still be able to use different array
implementations according to their needs.

Unfortunately, ``__array_function__`` has limitations, one of them being array
creation functions. In the example above, NumPy was able to call Dask's
implementation because the input array was a Dask array. The same is not true
for array creation functions, in the example the input of ``arange`` is simply
the integer ``5``, not providing any information of the array type that should
be the result, that's where a reference array passed by the ``like=`` argument
proposed here can be of help, as it provides NumPy with the information
required to create the expected type of array.

The new ``like=`` keyword proposed is solely intended to identify the downstream
library where to dispatch and the object is used only as reference, meaning that
no modifications, copies or processing will be performed on that object.

We expect that this functionality will be mostly useful to library developers,
allowing them to create new arrays for internal usage based on arrays passed
by the user, preventing unnecessary creation of NumPy arrays that will
ultimately lead to an additional conversion into a downstream array type.

Support for Python 2.7 has been dropped since NumPy 1.17, therefore we make use
of the keyword-only argument standard described in PEP-3102 [2]_ to implement
``like=``, thus preventing it from being passed by position.

.. _neps.like-kwarg.usage-and-impact:

Usage and Impact
----------------

To understand the intended use for ``like=``, and before we move to more complex
cases, consider the following illustrative example consisting only of NumPy and
CuPy arrays:

.. code:: python

    import numpy as np
    import cupy

    def my_pad(arr, padding):
        padding = np.array(padding, like=arr)
        return np.concatenate((padding, arr, padding))

    my_pad(np.arange(5), [-1, -1])    # Returns np.ndarray
    my_pad(cupy.arange(5), [-1, -1])  # Returns cupy.core.core.ndarray

Note in the ``my_pad`` function above how ``arr`` is used as a reference to
dictate what array type padding should have, before concatenating the arrays to
produce the result. On the other hand, if ``like=`` wasn't used, the NumPy case
case would still work, but CuPy wouldn't allow this kind of automatic
conversion, ultimately raising a
``TypeError: Only cupy arrays can be concatenated`` exception.

Now we should look at how a library like Dask could benefit from ``like=``.
Before we understand that, it's important to understand a bit about Dask basics
and ensures correctness with ``__array_function__``. Note that Dask can compute
different sorts of objects, like dataframes, bags and arrays, here we will focus
strictly on arrays, which are the objects we can use ``__array_function__``
with.

Dask uses a graph computing model, meaning it breaks down a large problem in
many smaller problems and merge their results to reach the final result. To
break the problem down into smaller ones, Dask also breaks arrays into smaller
arrays, that it calls "chunks". A Dask array can thus consist of one or more
chunks and they may be of different types. However, in the context of
``__array_function__``, Dask only allows chunks of the same type, for example,
a Dask array can be formed of several NumPy arrays or several CuPy arrays, but
not a mix of both.

To avoid mismatched types during compute, Dask keeps an attribute ``_meta`` as
part of its array throughout computation, this attribute is used to both predict
the output type at graph creation time and to create any intermediary arrays
that are necessary within some function's computation. Going back to our
previous example, we can use ``_meta`` information to identify what kind of
array we would use for padding, as seen below:

.. code:: python

    import numpy as np
    import cupy
    import dask.array as da
    from dask.array.utils import meta_from_array

    def my_pad(arr, padding):
        padding = np.array(padding, like=meta_from_array(arr))
        return np.concatenate((padding, arr, padding))

    # Returns dask.array<concatenate, shape=(9,), dtype=int64,
chunksize=(5,), chunktype=numpy.ndarray>
    my_pad(da.arange(5), [-1, -1])

    # Returns dask.array<concatenate, shape=(9,), dtype=int64,
chunksize=(5,), chunktype=cupy.ndarray>
    my_pad(da.from_array(cupy.arange(5)), [-1, -1])

Note how ``chunktype`` in the return value above changes from
``numpy.ndarray`` in the first ``my_pad`` call to ``cupy.ndarray`` in the
second.

To enable proper identification of the array type we use Dask's utility function
``meta_from_array``, which was introduced as part of the work to support
``__array_function__``, allowing Dask to handle ``_meta`` appropriately. That
function is primarily targeted at the library's internal usage to ensure chunks
are created with correct types. Without the ``like=`` argument, it would be
impossible to ensure ``my_pad`` creates a padding array with a type matching
that of the input array, which would cause cause a ``TypeError`` exception to
be raised by CuPy, as discussed above would happen to the CuPy case alone.

Backward Compatibility
----------------------

This proposal does not raise any backward compatibility issues within NumPy,
given that it only introduces a new keyword argument to existing array creation
functions with a default ``None`` value, thus not changing current behavior.

On Sun, Aug 16, 2020 at 1:41 PM Ralf Gommers <ralf.gommers at gmail.com> wrote:
>
>
>
> On Fri, Aug 14, 2020 at 12:23 PM Peter Andreas Entschev <peter at entschev.com> wrote:
>>
>> Hi all,
>>
>> This thread has IMO drifted very far from its original purpose, due to that I decided to start a new thread specifically for the general NEP procedure discussed, please check your mail for "NEP Procedure Discussion" subject.
>
>
> Thanks Peter. For future reference: better to just edit the thread subject, but not start over completely - people want to reply to previous content. I will copy over comments I'd like to reply to to the other thread by hand now.
>
>>
>> On the topic of this thread, I'll try to rewrite NEP-35 to make it more accessible and ping back here once I have a PR for that.
>
>
> Thanks!
>
> Cheers,
> Ralf
>
>> Is there anything else that's pressing here? If there is and I missed/forgot about it, please let me know.
>>
>> Best,
>> Peter
>>
>> On Fri, Aug 14, 2020 at 5:00 AM Juan Nunez-Iglesias <jni at fastmail.com> wrote:
>>>
>>> Hello everyone again!
>>>
>>> A few clarifications about my proposal of external peer review:
>>>
>>> - Yes, all this work is public and announced on the mailing list. However, I don’t think there’s a single person in this discussion or even this whole ecosystem that does not have a more immediately-pressing and also virtually infinite to-do list, so it’s unreasonable to expect that generally they would do more than glance at the stuff in the mailing list. In the peer review analogy, the mailing list is like the arXiv or Biorxiv stream — yep, anyone can see the stuff on there and comment, but most people just don’t have the time or attention to grab onto that. The only reason I stopped to comment here is Sebastian’s “Imma merge, YOLO!”, which had me raising my eyebrows real high. Especially for something that would expand the NumPy API!
>>>
>>> - So, my proposal is that there needs to be an *editor* of NEPs who takes responsibility, once they are themselves satisfied with the NEP, for seeking out external reviewers and pinging them individually and asking them if they would be ok to review.
>>>
>>> - A good friend who does screenwriting once told me, “don’t use all your proofreaders at once”. You want to get feedback, improve things, then feedback from a *totally independent* new person who can see the document with fresh eyes.
>>>
>>> Obviously, all of the above slows things down. But “alone we go fast, together we go far”. The point of a NEP is to document critical decisions for the long term health of the project. If the documentation is insufficient, it defeats the whole purpose. Might as well just implement stuff and skip the whole NEP process. (Side note: Stephan, I for one would definitely appreciate an update to existing NEPs if there’s obvious ways they can be improved!)
>>>
>>> I do think that NEP templates should be strict, and I don’t think that is incompatible with plain, jargon-free text. The NEP template and guidelines should specify that, and that the motivation should be understandable by a casual NumPy user — the kind described by Ilhan, for whom bare NumPy actually meets all their needs. Maybe they’ve also used PyTorch but they’ve never really had cause to mix them or write a program that worked with both kinds of arrays.
>>>
>>> Ditto for backwards compatibility — everyone should be clear when their existing code is going to be broken. Actually NEP18 broke so much of my code, but its Backward compatibility section basically says all good! https://numpy.org/neps/nep-0018-array-function-protocol.html#backward-compatibility
>>>
>>> Anywho, as always, none of this is criticism to work done — I thank you all, and am eternally grateful for all the hard work everyone is doing to keep the ecosystem from fragmenting. I’m just hoping that this discussion can improve the process going forward!
>>>
>>> And, yes, apologies to Peter, I know from repeated personal experience how frustrating it can be to have last-minute drive-by objections after months of consensus building! But I think in the end every time that happened the end result was better — I hope the same is true here! And yes, I’ll reiterate Ralf’s point: my concerns are about the NEP process itself rather than this one. I’ll summarise my proposal:
>>>
>>> - strict NEP template. NEPs with missing sections will not be accepted.
>>> - sections Abstract, Motivation, and Backwards Compatibility should be understandable at a high level by casual users with ~zero background on the topic
>>> - enforce the above with at least two independent rounds of coordinated peer review.
>>>
>>> Thank you,
>>>
>>> Juan.
>>>
>>> On 14 Aug 2020, at 5:29 am, Stephan Hoyer <shoyer at gmail.com> wrote:
>>>
>>> On Thu, Aug 13, 2020 at 5:22 AM Ralf Gommers <ralf.gommers at gmail.com> wrote:
>>>>
>>>> Thanks for raising these concerns Ilhan and Juan, and for answering Peter. Let me give my perspective as well.
>>>>
>>>> To start with, this is not specifically about Peter's NEP and PR. NEP 35 simply follows the pattern set by previous PRs, and given its tight scope is less difficult to understand than other NEPs on such technical topics. Peter has done a lot of things right, and is close to the finish line.
>>>>
>>>>
>>>> On Thu, Aug 13, 2020 at 12:02 PM Peter Andreas Entschev <peter at entschev.com> wrote:
>>>>>
>>>>>
>>>>> > I think, arriving to an agreement would be much faster if there is an executive summary of who this is intended for and what the regular usage is. Because with no offense, all I see is "dispatch", "_array_function_" and a lot of technical details of which I am absolutely ignorant.
>>>>>
>>>>> This is what I intended to do in the Usage Guidance [2] section. Could
>>>>> you elaborate on what more information you'd want to see there? Or is
>>>>> it just a matter of reorganizing the NEP a bit to try and summarize
>>>>> such things right at the top?
>>>>
>>>>
>>>> We adapted the NEP template [6] several times last year to try and improve this. And specified in there as well that NEP content set to the mailing list should only contain the sections: Abstract, Motivation and Scope, Usage and Impact, and Backwards compatibility. This to ensure we fully understand the "why" and "what" before the "how". Unfortunately that template and procedure hasn't been exercised much yet, only in NEP 38 [7] and partially in NEP 41 [8].
>>>>
>>>> If we have long-time maintainers of SciPy (Ilhan and myself), scikit-image (Juan) and CuPy (Leo, on the PR review) all saying they don't understand the goals, relevance, target audience, or how they're supposed to use a new feature, that indicates that the people doing the writing and having the discussion are doing something wrong at a very fundamental level.
>>>>
>>>> At this point I'm pretty disappointed in and tired of how we write and discuss NEPs on technical topics like dispatching, dtypes and the like. People literally refuse to write down concrete motivations, goals and non-goals, code that's problematic now and will be better/working post-NEP and usage examples before launching into extensive discussion of the gory details of the internals. I'm not sure what to do about it. Completely separate API and behavior proposals from implementation proposals? Make separate "API" and "internals" teams with the likes of Juan, Ilhan and Leo on the API team which then needs to approve every API change in new NEPs? Offer to co-write NEPs if someone is willing but doesn't understand how to go about it? Keep the current structure/process but veto further approvals until NEP authors get it right?
>>>
>>>
>>> I think the NEP template is great, and we should try to be more diligent about following it!
>>>
>>> My own NEP 37 (__array_module__) is probably a good example of poor presentation due to not following the template structure. It goes pretty deep into low-level motivation and some implementation details before usage examples.
>>>
>>> Speaking just for myself, I would have appreciated a friendly nudge to use the template. Certainly I think it would be fine to require using the template for newly submitted NEPs. I did not remember about it when I started drafting NEP 37, and it definitely would have helped. I may still try to do a revision at some point to use the template structure.
>>>
>>>>
>>>> I want to make an exception for merging the current NEP, for which the plan is to merge it as experimental to try in downstream PRs and get more experience. That does mean that master will be in an unreleasable state by the way, which is unusual and it'd be nice to get Chuck's explicit OK for that. But after that, I think we need a change here. I would like to hear what everyone thinks is the shape that change should take - any of my above suggestions, or something else?
>>>>
>>>>
>>>>>
>>>>> > Finally as a minor point, I know we are mostly (ex-)academics but this necessity of formal language on NEPs is self-imposed (probably PEPs are to blame) and not quite helping. It can be a bit more descriptive in my external opinion.
>>>>>
>>>>> TBH, I don't really know how to solve that point, so if you have any
>>>>> specific suggestions, that's certainly welcome. I understand the
>>>>> frustration for a reader trying to understand all the details, with
>>>>> many being only described in NEP-18 [3], but we also strive to avoid
>>>>> rewriting things that are written elsewhere, which would also
>>>>> overburden those who are aware of what's being discussed.
>>>>>
>>>>>
>>>>> > I also share Ilhan’s concern (and I mentioned this in a previous NEP discussion) that NEPs are getting pretty inaccessible. In a sense these are difficult topics and readers should be expected to have *some* familiarity with the topics being discussed, but perhaps more effort should be put into the context/motivation/background of a NEP before accepting it. One way to ensure this might be to require a final proofreading step by someone who has not been involved at all in the discussions, like peer review does for papers.
>>>>
>>>>
>>>> Some variant of this proposal would be my preference.
>>>>
>>>> Cheers,
>>>> Ralf
>>>>
>>>>>
>>>>> [1] https://github.com/numpy/numpy/issues/14441#issuecomment-529969572
>>>>> [2] https://numpy.org/neps/nep-0035-array-creation-dispatch-with-array-function.html#usage-guidance
>>>>> [3] https://numpy.org/neps/nep-0018-array-function-protocol.html
>>>>> [4] https://numpy.org/neps/nep-0000.html#nep-workflow
>>>>> [5] https://mail.python.org/pipermail/numpy-discussion/2019-October/080176.html
>>>>
>>>>
>>>> [6] https://github.com/numpy/numpy/blob/master/doc/neps/nep-template.rst
>>>> [7] https://github.com/numpy/numpy/blob/master/doc/neps/nep-0038-SIMD-optimizations.rst
>>>> [8] https://github.com/numpy/numpy/blob/master/doc/neps/nep-0041-improved-dtype-support.rst
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Aug 13, 2020 at 3:44 AM Juan Nunez-Iglesias <jni at fastmail.com> wrote:
>>>>> >
>>>>> > I’ve generally been on the “let the NumPy devs worry about it” side of things, but I do agree with Ilhan that `like=` is confusing and `typeof=` would be a much more appropriate name for that parameter.
>>>>> >
>>>>> > I do think library writers are NumPy users and so I wouldn’t really make that distinction, though. Users writing their own analysis code could very well be interested in writing code using numpy functions that will transparently work when the input is a CuPy array or whatever.
>>>>> >
>>>>> > I also share Ilhan’s concern (and I mentioned this in a previous NEP discussion) that NEPs are getting pretty inaccessible. In a sense these are difficult topics and readers should be expected to have *some* familiarity with the topics being discussed, but perhaps more effort should be put into the context/motivation/background of a NEP before accepting it. One way to ensure this might be to require a final proofreading step by someone who has not been involved at all in the discussions, like peer review does for papers.
>>>>> >
>>>>> > Food for thought.
>>>>> >
>>>>> > Juan.
>>>>> >
>>>>> > On 13 Aug 2020, at 9:24 am, Ilhan Polat <ilhanpolat at gmail.com> wrote:
>>>>> >
>>>>> > For what is worth, as a potential consumer in SciPy, it really doesn't say anything (both in NEP and the PR) about how the regular users of NumPy will benefit from this. If only and only 3rd parties are going to benefit from it, I am not sure adding a new keyword to an already confusing function is the right thing to do.
>>>>> >
>>>>> > Let me clarify,
>>>>> >
>>>>> > - This is already a very (I mean extremely very) easy keyword name to confuse with ones_like, zeros_like and by its nature any other interpretation. It is not signalling anything about the functionality that is being discussed. I would seriously consider reserving such obvious names for really obvious tasks. Because you would also expect the shape and ndim would be mimicked by the "like"d argument but it turns out it is acting more like "typeof=" and not "like=" at all. Because if we follow the semantics it reads as "make your argument asarray like the other thing" but it is actually doing, "make your argument an array with the other thing's type" which might not be an array after all.
>>>>> >
>>>>> > - Again, if this is meant for downstream libraries (because that's what I got out of the PR discussion, cupy, dask, and JAX were the only examples I could read) then hiding it in another function and writing with capital letters "this is not meant for numpy users" would be a much more convenient way to separate the target audience and regular users. numpy.astypedarray([[some data], [...]], type_of=x) or whatever else it may be would be quite clean and to the point with no ambiguous keywords.
>>>>> >
>>>>> > I think, arriving to an agreement would be much faster if there is an executive summary of who this is intended for and what the regular usage is. Because with no offense, all I see is "dispatch", "_array_function_" and a lot of technical details of which I am absolutely ignorant.
>>>>> >
>>>>> > Finally as a minor point, I know we are mostly (ex-)academics but this necessity of formal language on NEPs is self-imposed (probably PEPs are to blame) and not quite helping. It can be a bit more descriptive in my external opinion.
>>>>
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list
>>>> NumPy-Discussion at python.org
>>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at python.org
>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at python.org
>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion