[Numpy-discussion] Allow __getitem__ to support custom objects

Sebastian Berg sebastian at sipsolutions.net
Fri Oct 30 11:12:51 EDT 2020


On Thu, 2020-10-29 at 23:58 -0600, Aaron Meurer wrote:
> On Thu, Oct 29, 2020 at 6:09 PM Sebastian Berg
> <sebastian at sipsolutions.net> wrote:
> > On Tue, 2020-10-27 at 17:15 -0600, Aaron Meurer wrote:
> > > For ndindex (https://quansight.github.io/ndindex/), the biggest
> > > issue
> > > with the API is that to use an ndindex object to actually index
> > > an
> > > array, you have to use a[idx.raw] instead of a[idx]. This is
> > > because
> > > for NumPy arrays, you cannot allow custom objects to be indices.
> > > The
> > > exception is objects that define __index__, but this only works
> > > for
> > > integer indices. If __index__ returns anything other than an
> > > integer,
> > > you get an IndexError. This is annoying because it's easy to
> > > forget
> > > to
> > > do this when working with the ndindex API, and the error message
> > > from
> > > NumPy isn't informative about what went wrong unless you know to
> > > expect it.
> > > 
> > > I'd like to propose an API that would allow custom objects to
> > > define
> > > how they should be converted to a standard NumPy index, similar
> > > to
> > > __index__ but that supports all index types. I think there are
> > > two
> > > options here:
> > > 
> > > - Allow __index__ to return any index type, not just integers.
> > > This
> > > is
> > > the simplest because it reuses an existing API, and __index__ is
> > > the
> > > best possible name for this API. However, I'm not sure, but this
> > > may
> > > actually conflict with the text of PEP 357
> > > (https://www.python.org/dev/peps/pep-0357/). Also, some other
> > > APIs
> > > use
> > > __index__ to check if something is an indexable integer, which
> > > wouldn't accept generic index. For example, elements of a slice
> > > can
> > > be
> > > any object that defines __index__.
> > > 
> > 
> > Index converts to an integer (safely).  There is an assumptions
> > that
> > the integer is good for indexing, but I the name shouldn't be taken
> > to
> > mean it is specific to indexing (even if that was the main
> > motivation).
> > 
> > 
> > > - Add a new __numpy_index__ API that works like
> > > 
> > > def __numpy_index__(self):
> > >     return <tuple, integer, slice, newaxis, ellipsis, or integer
> > > or
> > > boolean array>
> > > 
> > > In NumPy, __getitem__ and __setitem__ on ndarray would first
> > > check if
> > > the input index type is one of the known types as it currently
> > > does,
> > > then it would try __index__, and if neither of those fails, it
> > > would
> > > call __numpy_index__(index) and use that.
> > 
> > Do you anticipate just:
> > 
> >     arr[index]
> > 
> > or also:
> > 
> >     arr[index1, index2]
> 
> I think both should work. If the second one doesn't work it would be
> surprising.
> 
> > Would you expect pandas or array-like objects to support this as
> > well?
> 
> Yes, it would probably be best for array-like to also work with the
> same API.
> 
> I don't know much about Pandas. It seems like it already allows a lot
> of indexing stuff. Do Series/Dataframe already have such an API?

I do not think so, but indexing in pandas works differently often. So I
was curious whether y

> 
> > If we only do `arr[index]` might subclassing tuple be sufficient?
> 
> I guess that technically works, except now your objects have to act
> like a tuple, even if they represent something like a slice (Python
> does not allow subclassing slice). For ndindex I've tried to make a
> distinction between objects as representing indices and the built-in
> objects that happen to be used to represent those indices by default.
> So an ndindex.Tuple explicitly doesn't work like a Tuple, an
> ndindex.Integer doesn't work like an int, and so on. That way there
> is
> a clear distinction between ndindex operations and operations on the
> built-in types.
> 
> > Do
> > you have any thought on how this might play out with a potential
> > `arr.oindex[...]`?
> 
> I think oindex[idx] would call the same API on idx. I'm not sure if
> it
> matters that it's oindex, since that's at a higher level.

It is at a higher level, but it seemed to me that `ndindex` largely
plays at that level.  For example, you have a method to implement index
chaining:

    arr[idx1][idx2] == arr[idx1.as_subindex(idx2)]

(or similar). But this will not work:

    arr.oindex[idx1].oindex[idx2] != arr.idx[idx1.as_subindex(idx2)]

Also the "result" shape, or even questions like `.isempty()` will give
different answers when  used as an `.oindex[...]`.

This is why I though that `arr[idx1, idx2]` is possibly very different
case from `arr[idx]` at least for current NumPy indexing logic (it
would be better with `arr.oindex[]`).
The difference doesn't matter in your proposal, but I had the
impression that the `arr[idx1, idx2]` form might be rare/unused and
that form would not be able to carry information such as whether this
is supposed to be an "oindex".

Maybe it helps to look back at `.oindex` to explain this. A possible
solution to subclass handling if we add `arr.oindex` is to make it so
that:

    myarr.oindex[indx]

could call:

    myarr.__getitem__(indx_object)

Where `index_object` knows that this is was an oindex.  The main reason
is the expectation that many subclasses may implement `__getitem__`,
but probably just do:

    def __getitem__(self, indx):
         new_data = self.data[indx]
         # Do something with new_data.

Now for `ndindex` it would seem to make a lot of sense to have an
OIndex object, etc. for the same reason.

Of course how we implement `.oindex` can be pretty separate from this.

> 
> > Adding either to NumPy is probably fairly straight forward,
> > although I
> > prefer either not slow down every single indexing operation for an
> > extremely niche use-case (which is likely possible) or timing that
> > it
> > is insignificant.
> 
> I'm not sure it would. The current cases would all be tried first.
> The
> only time the new protocol would be used is when the index type isn't
> one of the currently allowed types, which currently raises
> IndexError.
> 
> > What might help me is understanding that `ndindex` itself better.
> > Since
> > it seems like asking to add a protocol that may very well be used
> > by
> > only this one project?
> 
> That's fair. Maybe the more general API would make more sense then? I
> think it would need more thinking out, but it would allow a lot more
> use-cases.
> 

A general API might make sense, but I am edgy about reversing the roles
of who performs the indexing. For one thing that probably would break
subclassing and overriding of `__getitem__`?


Cheers,

Sebastian



> Aaron Meurer
> 
> > > Note: there is a more general way that NumPy arrays could allow
> > > __getitem__ to be defined on custom objects, which I am NOT
> > > proposing.
> > > Instead of an API that returns one of the current predefined
> > > index
> > > types (tuple, integer, slice, newaxis, ellipsis, or integer or
> > > boolean
> > > array), there could instead be an API that takes the array as
> > > input
> > > and returns another array (or view) as an output. This would
> > > allow an
> > > object to define itself as an index in arbitrary ways, even if
> > > such
> > > an
> > > index would not actually be possible via traditional indexing.
> > > There
> > > are definitely some interesting ideas that could be done with
> > > this,
> > > but this idea would be much more complicated, and isn't something
> > > that
> > > I need. Unless the community feels that a more general API like
> > > this
> > > would be preferred, I would suggest deferring something like it
> > > to a
> > > later discussion.
> > > 
> > > What would be the best way to go about getting something like
> > > this
> > > implemented? Is it simple enough that we can just work out the
> > > details
> > > here and on a pull request, or should I write a NEP?
> > 
> > A short NEP may make sense, at least if this is supposed to be a
> > generic protocol for general array-likes, which I guess it would
> > have
> > to be ready for.
> > 
> > Cheers,
> > 
> > Sebastian
> > 
> > 
> > > Aaron Meurer
> > > _______________________________________________
> > > NumPy-Discussion mailing list
> > > NumPy-Discussion at python.org
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > 
> > 
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at python.org
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <https://mail.python.org/pipermail/numpy-discussion/attachments/20201030/568a3cea/attachment.sig>


More information about the NumPy-Discussion mailing list