[Numpy-discussion] Future of ufuncs

Mon May 29 18:34:47 EDT 2017

On Mon, May 29, 2017 at 1:51 PM, Charles R Harris
<charlesr.harris at gmail.com> wrote:
>
>
> On Mon, May 29, 2017 at 12:32 PM, Marten van Kerkwijk
> <m.h.vankerkwijk at gmail.com> wrote:
>>
>> Hi Chuck,
>>
>> Like Sebastian, I wonder a little about what level you are talking
>> about. Presumably, it is the actual implementation of the ufunc? I.e.,
>> this is not about the upper logic that decides which `__array_ufunc__`
>> to call, etc.
>>
>> If so, I agree with you that it would seem to make most sense to move
>> the implementation to `multiarray`; the current structure certainly is
>> a major hurdle to understanding how things work!
>>
>> Indeed, I guess in terms of my earlier suggestion to make much of a
>> ufunc happen in `ndarray.__array_ufunc__`, one could seem the type
>> resolution and iteration happening there. If one were to expose the
>> inner loops, anyone working with buffers could then use the ufuncs by
>> defining their own __array_ufunc__.
>
>
> The idea of separating ufuncs from ndarray was put forward many years ago,
> maybe five or six. What I seek here is a record that we have given up on
> that ambition, so do not need to take it into consideration in the future.
> In particular, we can feel free to couple ufuncs even more tightly with
> ndarray.

I think we do want to separate ufuncs from ndarray semantically: it
should be possible to use ufuncs on sparse arrays, dask arrays, etc.
etc.

But I don't think that altering ufuncs to work directly on
buffer/memoryview objects, or shipping them as a separate package from
the rest of numpy, is a useful step towards this goal.

Right now, handling buffers/memoryviews is easy: one can trivially
convert between them and ndarray without making any copies. I don't
know of any interesting problems that are blocked because ufuncs work
on ndarrays instead of buffer/memoryview objects. The interesting
problems are where there's a fundamentally different storage strategy
involved, like sparse/dask/... arrays.

And similarly, I don't see what problems are solved by splitting them
out for building or distribution.

OTOH, trying to accomplish either of these things definitely has a
cost in terms of churn, complexity, double the workload for
release-management, etc. Even the current split between the multiarray
and umath modules causes problems all the time. It's mostly boring
problems like having little utility functions that are needed in both
places but awkward to share, or problems caused by the complicated
machinery needed to let them interact properly (set_numeric_ops and
all that) – this doesn't seem like stuff that's adding any value.

Plus, there's a major problem that buffers/memoryviews don't have any
way to represent all the dtypes we currently support (e.g. datetime64)
and don't have any way to add new ones, and the only way to fix this
would be to write a PEP, shepherding patches through python-dev,
waiting for the next python major release and then dropping support
for all older Python releases. None of this is going to happen soon;
probably we should plan on the assumption that it will never happen.
So I don't see how this could work at all.

So my vote is for merging the multiarray and umath code bases
together, and then taking advantage of the resulting flexibility to
refactor the internals to provide cleanly separated interfaces at the
API level.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org