[Numpy-discussion] Proposal to accept NEP-18, __array_function__ protocol

Mon Aug 20 11:26:21 EDT 2018

Ping to finish up this discussion so we can come to a conclusion. I'm in
favor of the NEP, as I don't see it as orthogonal to Nathaniel's concerns.
However, we might want to be selective as to which functions we expose via
the `__array_function__` method.

On Wed, Aug 15, 2018 at 10:45 AM, Stephan Hoyer <shoyer at gmail.com> wrote:

> Nathaniel,
>
> Thanks for raising these thoughtful concerns. Your independent review of
> this proposal is greatly appreciated!
>
> See my responses inline below:
>
> On Mon, Aug 13, 2018 at 2:44 AM Nathaniel Smith <njs at pobox.com> wrote:
>
>> The other approach would be to incrementally add clean, well-defined
>> dunder methods like __array_ufunc__, __array_concatenate__, etc. This
>> way we end up putting some thought into each interface, making sure
>> that it's something we can support, protecting downstream libraries
>> from unnecessary complexity (e.g. they can implement
>> __array_concatenate__ instead of hstack, vstack, row_stack,
>> column_stack, ...), or avoiding adding new APIs entirely (e.g., by
>> converting existing functions into ufuncs so __array_ufunc__ starts
>> automagically working). And in the end we get a clean list of dunder
>> methods that new array container implementations have to define. It's
>> plausible to imagine a generic test suite for array containers. (I
>> suspect that every library that tries to implement __array_function__
>> will end up with accidental behavioral differences, just because the
>> numpy API is so vast and contains so many corner cases.) So the
>> clean-well-defined-dunders approach has lots of upsides. The big
>> downside is that this is a much longer road to go down.
>>
>
> RE: accidental differences in behavior:
>
> I actually think that the __array_function__ approach is *less* prone to
> accidental differences in behavior, because we require implementing every
> function directly (or it raises an error).
>
> This avoids a classic subclassing problem that has plagued NumPy for
> years, where overriding the behavior of method A causes apparently
> unrelated method B to break, because it relied on method A internally. In
> NumPy, this constrained our implementation of np.median(), because it
> needed to call np.mean() in order for subclasses implementing units to work
> properly.
>
> There will certainly be accidental differences in behavior for third-party
> code that *uses* NumPy, but this is basically inevitable for any proposal
> to allow's NumPy's public API to be overloaded. It's also avoided by
> default by third-party libraries that follow the current best practice of
> casting all input arrays with np.asarray().
>
> --------------
>
> RE: a hypothetical simplified interface:
>
> The need to implement everything you want to use in NumPy's public API
> could certainly be onerous, but on the other hand there are a long list of
> projects that have already done this today -- and these are the projects
> that most need __array_function__.
>
> I'm sure there are cases were simplification would be warranted, but in
> particular I don't think __array_concatenate__ has significant advantages
> over simply implementing __array_function__ for np.concatenate. It's a
> slightly different way of spelling, but it basically does the same thing.
> The level of complexity to implement hstack, vstack, row_stack and
> column_stack in terms of np.concatenate is pretty minimal.
> __array_function__ implementors could easily copy and paste code from NumPy
> or use a third-party helpers library (like NDArrayOperatorsMixin) that
> provides such implementations.
>
> I also have other concerns about the "simplified API" approach beyond the
> difficulty of figuring it out, but those are already mentioned in the NEP:
> http://www.numpy.org/neps/nep-0018-array-function-protocol.
> html#implementations-in-terms-of-a-limited-core-api
>
> But... this is wishful thinking. No matter what the NEP says, I simply
>> don't believe that we'll actually go break dask, sparse arrays,
>> xarray, and sklearn in a numpy point release. Or any numpy release.
>> Nor should we. If we're serious about keeping this experimental – and
>> I think that's an excellent idea for now! – then IMO we need to do
>> something more to avoid getting trapped by backwards compatibility.
>>
>
> I agree, but to be clear, development for dask, sparse and xarray (and
> even broadly supported machine learning libraries like TensorFlow) still
> happens at a much faster pace than is currently the case for "core"
> projects in the SciPy stack like NumPy. It would not be a big deal to
> encounter breaking changes in a "major" NumPy release (i.e., 1.X ->
> 1.(X+1)).
>
> (Side note: sklearn doesn't directly implement any array types, so I don't
> think it would make use of __array_function__ in any way, except possibly
> to implement overloadable functions.)
>

Here is Travis Oliphant's talk at PyBay
<https://speakerdeck.com/teoliphant/ml-in-python>, where he talks about the
proliferation of arrays and interfaces in the ML/AI ecosystem among other
things. I think that we should definitely try to get NumPy out there as an
option in the near future.

>
>> My suggestion: at numpy import time, check for an envvar, like say
>> NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=1. If it's not set, then all the
>> __array_function__ dispatches turn into no-ops. This lets interested
>> downstream libraries and users try this out, but makes sure that we
>> won't have a hundred thousand end users depending on it without
>> realizing.
>
>
>
> - makes it easy for end-users to check how much overhead this adds (by
>> running their code with it enabled vs disabled)
>> - if/when we decide to commit to supporting it for real, we just
>> remove the envvar.
>>
>
> I'm slightly concerned that the cost of reading an environment variable
> with os.environ could exaggerate the performance cost of
> __array_function__. It takes about 1 microsecond to read an environment
> variable on my laptop, which is comparable to the full overhead of
> __array_function__. So we may want to switch to an explicit Python API
> instead, e.g., np.enable_experimental_array_function().
>
> My bigger concern is when/how we decide to graduate __array_function__
> from requiring an explicit opt-in. We don't need to make a final decision
> now, but it would be good to clear about what specifically we are waiting
> for.
>
> I see three types of likely scenarios for changing __array_function__:
> 1. We decide that the overloading the NumPy namespace in general is a bad
> idea, based on either performance or predictability consequences for
> third-party libraries. In this case, I imagine we would probably keep
> __array_function__, but revert to a separate namespace for explicitly
> overloaded functions, e.g., numpy.api.
> 2. We want to keep __array_function__, but need a breaking change to the
> interface (and we're really attached to keeping the name
> __array_function__).
> 3. We decide that specific functions should use a different interface
> (e.g., switch from __array_function__ to __array_ufunc__).
>
> (1) and (2) are the sort of major concerns that in my mind would warrant
> hiding a feature behind an experimental flag. For the most part, I expect
> (1) could be resolved relatively quickly by running benchmark suites after
> we have a working version of __array_function__. To be honest, I don't see
> either of these rollback scenarios as terribly likely, but the downside
> risk is large enough that we may want to protect ourselves for a major
> release or two (6-12 months).
>
> (3) will be a much longer process, likely to stretch out over years at the
> current pace of NumPy development. I don't think we'll want to keep an
> opt-in flag for this long of a period. Rather, we may want to accept a
> shorter deprecation cycle than usual. In most cases, I suspect we could
> incrementally switch to new overloads while preserving the
> __array_function__ overload for a release or two.
>
> I don't really understand the 'types' frozenset. The NEP says "it will
>> be used by most __array_function__ methods, which otherwise would need
>> to extract this information themselves"... but they still need to
>> extract the information themselves, because they still have to examine
>> each object and figure out what type it is. And, simply creating a
>> frozenset costs ~0.2 µs on my laptop, which is overhead that we can't
>> possibly optimize later...
>>
>
> The most flexible alternative would be to just say that we provide an
> fixed-length iterable, and return a tuple object. (In my microbenchmarks,
> it's faster to make a tuple than a list or set.) In an early draft of the
> NEP, I proposed exactly this, but the speed difference seemed really
> marginal to me.
>
> I included 'types' in the interface because I really do think it's
> something that almost all __array_function__ implementations should use
> use. It preserves a nice separation of concerns between dispatching logic
> and implementations for a new type. At least as long as __array_function__
> is experimental, I don't think we should be encouraging people to write
> functions that could return NotImplemented directly and to rely entirely on
> the NumPy interface.
>
> Many but not all implementations will need to look at argument types. This
> is only really essential for cases where mixed operations between NumPy
> arrays and another type are allowed. If you only implement the NumPy
> interface for MyArray objects, then in the usual Python style you wouldn't
> need isinstance checks.
>
> It's also important from an ecosystem perspective. If we don't make it
> easy to get type information, my guess is that many __array_function__
> authors wouldn't bother to return NotImplemented for unexpected types,
> which means that __array_function__ will break in weird ways when used with
> objects from unrelated libraries.
>
>

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20180820/42e87f46/attachment.html>