[Numpy-discussion] Indexing NEP draft

Sun Nov 29 13:56:05 EST 2015

Hey,

small update on this. The NEP draft has not changed much, but you can
now try the full power of the proposed new indexing types [1]:

  * arr.oindex[...]  # orthogonal/outer indexing
  * arr.vindex[...]  # vectorized (like fancy, but different ;))
  * arr.lindex[...]  # legacy/fancy indexing

with my pull request at https://github.com/numpy/numpy/pull/6075

You can try it locally by cloning the numpy github repository and then
running from the source dir:

git fetch upstream pull/6075/head:pr-6075 && git checkout pr-6075;
python runtests.py --ipython
# Inside ipython:
import warnings; warnings.simplefilter("always")

The examples from the NEP at should all run fine, you can find the NEP
draft at:
https://github.com/numpy/numpy/pull/6256/files?short_path=01e4dd9#diff-01e4dd9d2ecf994b24e5883f98f789e6

I would be most happy about any comments or suggestions!

- Sebastian

[1] Modulo possible bugs, there is not test suit yet....

On Mi, 2015-11-11 at 11:02 +0100, Sebastian Berg wrote:
> Hi all,
> 
> at scipy discussing with Nathaniel and others, we thought that maybe we
> can push for orthogonal type indexing into numpy. Now with the new
> version out and some other discussions done, I thought it is time to
> pick it up :).
> 
> The basic ideas are twofold. First make indexing easier and less
> confusing for starters (and advanced users also), and second improve
> interoperability with projects such as xray for whom orthogonal/outer
> type indexing makes more sense.
> 
> I have started working on:
> 
> 1. A preliminary draft of an NEP you can view at
> https://github.com/numpy/numpy/pull/6256/files?short_path=01e4dd9#diff-01e4dd9d2ecf994b24e5883f98f789e6
> or at the end of this mail.
> 
> 2. A preliminary implementation of `oindex` attribute with
> orthogonal/outer style indexing in
> https://github.com/numpy/numpy/pull/6075 which you can try out by
> cloning numpy and then running from the source dir:
> 
> git fetch upstream pull/6075/head:pr-6075 && git checkout pr-6075;
> python runtests.py --ipython
> 
> This will fetch my PR, switch to the branch and open an interactive
> ipython shell where you will be able to do arr.oindex[].
> 
> 
> Note that I consider the NEP quite preliminary in many parts, and it may
> still be very confusing unless you are well versed with current advanced
> indexing. There are some longer examples comparing the different styles
> and another "example" which tries to show a "use case" example going
> from simpler to more complex indexing operations.
> Any comments are very welcome, and if it is "I don't understand a
> word" :). I know it is probably too short and, at least without
> examples, not easy to understand.
> 
> Best,
> 
> Sebastian
> 
> 
> ==================================================================================
> The current NEP draft:
> 
> 
> ==========================================================
> Implementing intuitive and full featured advanced indexing
> ==========================================================
> 
> :Author: Sebastian Berg
> :Date: 2015-08-27
> :Status: draft
> 
> 
> Executive summary
> =================
> 
> Advanced indexing with multiple array indices is typically confusing to
> both new, and in many cases even old, users of NumPy. To avoid this
> problem
> and allow for more and clearer features, we propose to:
> 
> 1. Introduce ``arr.oindex[indices]`` which allows advanced indices, but
>    uses outer indexing logic.
> 2. Introduce ``arr.vindex[indices]`` which use the current
>    "vectorized"/broadcasted logic but with two differences from
>    fancy indexing:
>        
>    1. Boolean indices always use the outer indexing logic.
>       (Multi dimensional booleans should be allowed).
>    2. The integer index result dimensions are always the first axes
>       of the result array. No transpose is done, even for a single
>       integer array index.
> 
> 3. Vanilla indexing on the array will only give warnings and eventually
>    errors either:
>      
>    * when there is ambiguity between legacy fancy and outer indexing
>      (note that ``arr[[1, 2], :, 0]`` is such a case, an integer
>      can be the "second" integer index array),
>    * when any integer index array is present (possibly additional for
>      more then one boolean index array).
> 
> These constraints are sufficient for making indexing generally
> consistent
> with expectations and providing a less surprising learning curve with
> ``oindex``.
> 
> Note that all things mentioned here apply both for assignment as well as
> subscription.
> 
> Understanding these details is *not* easy. The `Examples` section gives
> code
> examples. And the hopefully easier `Motivational Example` provides some
> motivational use-cases for the general ideas and is likely a good start
> for
> anyone not intimately familiar with advanced indexing.
> 
> 
> Motivation
> ==========
> 
> Old style advanced indexing with multiple array (boolean or integer)
> indices,
> also called "fancy indexing", tends to be very confusing for new users.
> While fancy (or legacy) indexing is useful in many cases one would
> naively
> assume that the result of multiple 1-d ranges is analogous to multiple
> slices along each dimension (also called "outer indexing").
> 
> However, legacy fancy indexing with multiple arrays broadcasts these
> arrays
> into a single index over multiple dimensions. There are three main
> points
> of confusion when multiple array indices are involved:
> 
> 1. Most new users will usually expect outer indexing (consistent with
>    slicing). This is also the most common way of handling this in other
>    packages or languages.
> 2. The axes introduced by the array indices are at the front, unless
>    all array indices are consecutive, in which case one can deduce where
>    the user "expects" them to be:
> 
>    * `arr[:, [0, 1], :, [0, 1]]` will have the first dimension shaped 2.
>    * `arr[:, [0, 1], [0, 1]]` will have the second dimension shaped 2.
> 
> 3. When a boolean array index is mixed with another boolean or integer
>    array, the result is very hard to understand (the boolean array is
>    converted to integer array indices and then broadcast), and hardly
>    useful.
>    There is no well defined broadcast for booleans, so that boolean
>    indices are logically always "``outer``" type indices.
> 
> 
> Proposed rules
> ==============
> 
> From the three problems noted above some expectations for NumPy can
> be deduced:
> 
> 1. There should be a prominent outer/orthogonal indexing method such as
>    ``arr.oindex[indices]``.
> 2. Considering how confusing fancy indexing can be, it should only
>    occur explicitly (e.g. ``arr.vindex[indices]``)
> 3. A new ``arr.vindex[indices]`` method, would not be tied to the
>    confusing transpose rules of fancy indexing (which is for example
>    needed for the simple case of a single advanced index). Thus, it
>    no transposing should be done. The axes of the advanced indices are
>    always inserted at the front, even for a single index.
> 4. Boolean indexing is conceptionally outer indexing. A broadcasting
>    together with other advanced indices in the manner of legacy
>    "fancy indexing" is generally not helpful or well defined.
>    A user who wishes the "``nonzero``" plus broadcast behaviour can thus
>    be expected to do this manually.
>    Using this rule, a single boolean index can index into multiple
>    dimensions at once.
> 5. An ``arr.lindex`` or ``arr.findex`` should likely be implemented to
> allow
>    legacy fancy indexing indefinetly. This also gives a simple way to
>    update fancy indexing code making deprecations to vanilla indexing
>    easier.
> 6. Vanilla indexing ``arr[...]`` could return an error for ambiguous
> cases.
>    For the beginning, this probably means cases where ``arr[ind]`` and
>    ``arr.oindex[ind]`` return different results gives deprecation
> warnings.
>    However, the exact rules for this (especially the final behaviour)
> are not
>    quite clear in cases such as ``arr[0, :, index_arr]``.
> 
> All other rules for indexing are identical. 
> 
> 
> Open Questions
> ==============
> 
> 1. Especially for the new indexing attributes ``oindex`` and ``vindex``,
>    a case could be made to not implicitly add an ``Ellipsis`` index if
> necessary.
>    This helps finding bugs since a too high dimensional array can be
> caught.
>    (I am in favor for this, but doubt we should think about this for
> vanilla
>    indexing.)
> 
> 2. The names ``oindex`` and ``vindex`` are just suggestions at the time
> of
>    writing this, another name NumPy has used for something like
> ``oindex``
>    is ``np.ix_``. See also below.
> 
> 3. It would be possible to limit the use of boolean indices in
> ``vindex``,
>    assuming that they are rare and to some degree special.
>    (This would make implementation simpler, but I do not see a big
> reason.)
> 
> 4. ``oindex`` and ``vindex`` could always return copies, even when no
> array
>    operation occurs. One argument for using the same rules is that this
> way
>    ``oindex`` can be used as a general index replacement.
>    (There is likely no big reason for this, however, there is one
> reason:
>    ``arr.vindex[array_scalar, ...]`` can occur, where ``arr_scalar``
>    should be a 0-D array. Copying always "fixes" the possible
> inconsistency.)
> 
> 5. The final state to morph indexing in is not fixed in this PEP. It is
> for
>    example possible that `arr[index]`` will be equivalent to
> ``arr.oindex``
>    at some point in the future. Since such a change will take years, it
>    seems unnecessary to make specific decisions now.
> 
> 6. Proposed changes to vanilla indexing could be postponed indefinetly
> or
>    not taken in order to not break or force fixing of existing code
> bases.
> 
> 7. Possible the ``vindex`` combination with boolean indexing could be
>    rethought or not allowed at all for simplicity.
> 
> 
> Necessary changes to NumPy
> ==========================
> 
> Implement ``arr.oindex`` and ``arr.vindex`` objects to allow these
> indexing
> operations and create warnings (and eventually deprecate) ambiguous
> direct
> indexing operations on arrays.
> 
> 
> Alternative Names
> =================
> 
> Possible names suggested (more suggestions will be added).
> 
> ==============  ======== =======
> **Orthogonal**  oindex   oix
> **Vectorized**  vindex   fix
> **Legacy**      l/findex 
> ==============  ======== =======
> 
> 
> Examples
> ========
> 
> Since the various kinds of indexing is hard to grasp in many cases,
> these
> examples hopefully give some more insights. Note that they are all in
> terms
> of shape. All original dimensions start with 5, advanced indexing
> inserts less long dimensions. (Note that ``...`` or ``Ellipsis`` mostly
> inserts as many slices as needed to index the full array). These
> examples
> may be hard to grasp without working knowledge of advanced indexing as
> of
> NumPy 1.9.
> 
> Example array::
> 
>     >>> arr = np.ones((5, 6, 7, 8))
> 
> 
> Legacy fancy indexing
> ---------------------
> 
> Single index is transposed (this is the same for all indexing types)::
> 
>     >>> arr[[0], ...].shape
>     (1, 6, 7, 8)
>     >>> arr[:, [0], ...].shape
>     (5, 1, 7, 8)
> 
> 
> Multiple indices are transposed *if* consecutive::
> 
>     >>> arr[:, [0], [0], :].shape  # future error
>     (5, 1, 7)
>     >>> arr[:, [0], :, [0]].shape  # future error
>     (1, 5, 6)
> 
> 
> It is important to note that a scalar *is* integer array index in this
> sense
> (and gets broadcasted with the other advanced index)::
> 
>     >>> arr[:, [0], 0, :].shape  # future error (scalar is "fancy")
>     (5, 1, 7)
>     >>> arr[:, [0], :, 0].shape  # future error (scalar is "fancy")
>     (1, 5, 6)
> 
> 
> Single boolean index can act on multiple dimensions (especially the
> whole
> array). It has to match (as of 1.10. a deprecation warning) the
> dimensions.
> The boolean index is otherwise identical to (multiple consecutive)
> integer
> array indices::
> 
>     >>> # Create boolean index with one True value for the last two
> dimensions:
>     >>> bindx = np.zeros((7, 8), dtype=np.bool_)
>     >>> bindx[[0, 0]] = True
>     >>> arr[:, 0, bindx].shape
>     (5, 1)
>     >>> arr[0, :, bindx].shape
>     (1, 6)
> 
> 
> The combination with anything that is not a scalar is confusing, e.g.::
> 
>     >>> arr[[0], :, bindx].shape  # bindx result broadcasts with [0]
>     (1, 6)
>     >>> arr[:, [0, 1], bindx]  # IndexError
> 
> 
> Outer indexing
> --------------
> 
> Multiple indices are "orthogonal" and their result axes are inserted 
> at the same place (they are not broadcasted)::
> 
>     >>> arr.oindex[:, [0], [0, 1], :].shape
>     (5, 1, 2, 8)
>     >>> arr.oindex[:, [0], :, [0, 1]].shape
>     (5, 1, 7, 2)
>     >>> arr.oindex[:, [0], 0, :].shape
>     (5, 1, 8)
>     >>> arr.oindex[:, [0], :, 0].shape
>     (5, 1, 7)
> 
> 
> Boolean indices results are always inserted where the index is::
> 
>     >>> # Create boolean index with one True value for the last two
> dimensions:
>     >>> bindx = np.zeros((7, 8), dtype=np.bool_)
>     >>> bindx[[0, 0]] = True
>     >>> arr.oindex[:, 0, bindx].shape
>     (5, 1)
>     >>> arr.oindex[0, :, bindx].shape
>     (6, 1)
> 
> 
> Nothing changed in the presence of other advanced indices since::
> 
>     >>> arr.oindex[[0], :, bindx].shape
>     (1, 6, 1)
>     >>> arr.oindex[:, [0, 1], bindx]
>     (5, 2, 1)
> 
> 
> Vectorized/inner indexing
> -------------------------
> 
> Multiple indices are broadcasted and iterated as one like fancy
> indexing,
> but the new axes area always inserted at the front::
> 
>     >>> arr.vindex[:, [0], [0, 1], :].shape
>     (2, 5, 8)
>     >>> arr.vindex[:, [0], :, [0, 1]].shape
>     (2, 5, 7)
>     >>> arr.vindex[:, [0], 0, :].shape
>     (1, 5, 8)
>     >>> arr.vindex[:, [0], :, 0].shape
>     (1, 5, 7)
> 
> 
> Boolean indices results are always inserted where the index is, exactly
> as in ``oindex`` given how specific they are to the axes they operate
> on::
> 
>     >>> # Create boolean index with one True value for the last two
> dimensions:
>     >>> bindx = np.zeros((7, 8), dtype=np.bool_)
>     >>> bindx[[0, 0]] = True
>     >>> arr.vindex[:, 0, bindx].shape
>     (5, 1)
>     >>> arr.vindex[0, :, bindx].shape
>     (6, 1)
> 
> 
> But other advanced indices are again transposed to the front::
> 
>     >>> arr.vindex[[0], :, bindx].shape
>     (1, 6, 1)
>     >>> arr.vindex[:, [0, 1], bindx]
>     (2, 5, 1)
> 
> 
> Related Questions
> =================
> 
> There exist a further indexing or indexing like method. That is the
> inverse of a command such as ``np.argmin(arr, axis=axis)``, to pick
> the specific elements *along* an axis given an array of (at least
> typically) the same size.
> 
> Doing such a thing with the indexing notation is not quite straight
> forward
> since the axis on which to pick elements has to be supplied. One
> plausible
> solution would be to create a function (calling it pick here for
> simplicity)::
> 
>      np.pick(arr, index_arr, axis=axis)
> 
> where ``index_arr`` has to be the same shape as ``arr`` except along
> ``axis``.
> One could imagine that this can be useful together with other indexing
> types,
> but such a function may be sufficient and extra information needed seems
> easier
> to pass using a function convention. Another option would be to allow an
> argument
> such as ``compress_axes=None`` (just to have some name) which maps the
> axes from
> the index array to the new array with ``None`` signaling a new axis.
> Also keepdims could be added as a simple default. (Note that the use of
> axis is not
> compatible to ``np.take`` for an ``index_arr`` which is not zero or one
> dimensional.)
> 
> Another solution is to provide functions or features to the
> ``arg*``functions
> to map this to the equivalent ``vindex`` indexing operation.
> 
> 
> Motivational Example
> ====================
> 
> Imagine having a data acquisition software storing ``D`` channels and
> ``N`` datapoints along the time. She stores this into an ``(N, D)``
> shaped
> array. During data analysis, we needs to fetch a pool of channels, for
> example
> to calculate a mean over them.
> 
> This data can be faked using::
> 
>     >>> arr = np.random.random((100, 10))
> 
> Now one may remember indexing with an integer array and find the correct
> code::
> 
>     >>> group = arr[:, [2, 5]]
>     >>> mean_value = arr.mean()
> 
> However, assume that there were some specific time points (first
> dimension
> of the data) that need to be specially considered. These time points are
> already known and given by::
> 
>     >>> interesting_times = np.array([1, 5, 8, 10], dtype=np.intp)
> 
> Now to fetch them, we may try to modify the previous code::
> 
>     >>> group_at_it = arr[interesting_times, [2, 5]]
>     IndexError: Ambiguous index, use `.oindex` or `.vindex`
> 
> An error such as this will point to read up the indexing documentation.
> This should make it clear, that ``oindex`` behaves more like slicing.
> So, out of the different methods it is the obvious choice
> (for now, this is a shape mismatch, but that could possibly also mention
> ``oindex``)::
> 
>     >>> group_at_it = arr.oindex[interesting_times, [2, 5]]
> 
> Now of course one could also have used ``vindex``, but it is much less
> obvious how to achieve the right thing!::
> 
>     >>> reshaped_times = interesting_times[:, np.newaxis]
>     >>> group_at_it = arr.vindex[reshaped_times, [2, 5]]
> 
> 
> One may find, that for example our data is corrupt in some places.
> So, we need to replace these values by zero (or anything else) for these
> times. The first column may for example give the necessary information,
> so that changing the values becomes easy remembering boolean indexing::
> 
>     >>> bad_data = arr[0] > 0.5
>     >>> arr[bad_data, :] = 0
> 
> Again, however, the columns may need to be handled more individually
> (but in
> groups), and the ``oindex`` attribute works well::
> 
>     >>> arr.oindex[bad_data, [2, 5]] = 0
> 
> Note that it would be very hard to do this using legacy fancy indexing.
> The only way would be to create an integer array first::
> 
>     >>> bad_data_indx = np.nonzero(bad_data)[0]
>     >>> bad_data_indx_reshaped = bad_data_indx[:, np.newaxis]
>     >>> arr[bad_data_indx_reshaped, [2, 5]]
> 
> In any case we can use only ``oindex`` to do all of this without getting
> into any trouble or confused by the whole complexity of advanced
> indexing.
> 
> But, some new features are added to the data acquisition. Different
> sensors
> have to be used depending on the times. Let us assume we already have
> created an array of indices::
> 
>     >>> correct_sensors = np.random.randint(10, size=(100, 2))
> 
> Which lists for each time the two correct sensors in an ``(N, 2)``
> array.
> 
> A first try to achieve this may be ``arr[:, correct_sensors]`` and this
> does
> not work. It should be clear quickly that slicing cannot achieve the
> desired
> thing. But hopefully users will remember that there is ``vindex`` as a
> more
> powerful and flexible approach to advanced indexing.
> One may, if trying ``vindex`` randomly, be confused about::
> 
>     >>> new_arr = arr.vindex[:, correct_sensors]
> 
> which is neither the same, nor the correct result (see transposing
> rules)!
> This is because slicing works still the same in ``vindex``. However,
> reading
> the documentation and examples, one can hopefully quickly find the
> desired
> solution::
> 
>     >>> rows = np.arange(len(arr))
>     >>> rows = rows[:, np.newaxis]  # make shape fit with
> correct_sensors
>     >>> new_arr = arr.vindex[rows, correct_sensors]
>     
> At this point we have left the straight forward world of ``oindex`` but
> can
> do random picking of any element from the array. Note that in the last
> example
> a method such as mentioned in the ``Related Questions`` section could be
> more
> straight forward. But this approach is even more flexible, since
> ``rows``
> does not have to be a simple ``arange``, but could be
> ``intersting_times``::
> 
>     >>> correct_sensors_at_it = correct_sensors[interesting_times, :]
>     >>> interesting_times_reshaped = interesting_times[:, np.newaxis]
>     >>> new_arr_it = arr[interesting_times_reshaped,
> correct_sensors_at_it]
> 
> Truly complex situation would arise now if you would for example pool
> ``L``
> experiments into an array shaped ``(L, N, D)``. But for ``oindex`` this
> should
> not result into surprises. ``vindex``, being more powerful, will quite
> certainly create some confusion in this case but also cover pretty much
> all
> eventualities.
> 
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20151129/77933284/attachment.sig>