[Numpy-discussion] What is up with raw boolean indices (like a[False])?

Thu Aug 20 19:08:39 EDT 2020

On Thu, Aug 20, 2020 at 4:38 PM Sebastian Berg
<sebastian at sipsolutions.net> wrote:
>
> On Thu, 2020-08-20 at 16:00 -0600, Aaron Meurer wrote:
> > Just to be clear, what exactly do you think should be deprecated?
> > Boolean scalar indices in general, or just boolean scalars combined
> > with other arrays, or something else?
>
> My angle is that we should allow only:
>
> * Any number of integer array indices (ideally only explicitly
>   with `arr.vindex[]`, but we do not have that luxury right now.)
>
> * A single boolean index (array or scalar is identical)
>
> but no mix of the above (including multiple boolean indices).
>
> Because I think they are at least one level more confusing than
> multiple advanced indices.
>
> I admit, I forgot that the broadcasting logic is fine in this case:
>
>    arr = np.zeros((2, 3))
>    arr[[True], np.array(3)]
>
> where the advanced index is also a scalar index. In that case the
> result is straight forward, since broadcasting does not affect
> `np.array(3)`.
>
>
> I am happy to be wrong about that assessment, but I think your opinion
> on it could likely push us towards just doing a Deprecation.
> The only use case for "multiple boolean indices" that I could think of
> was this:
>
>     arr = np.diag([1, 2, 3, 4])  # 2-d square array
>     indx = arr.diagonal() > 2  # mask for each row and column
>     masked_diagonal = arr[indx, indx]
>     print(repr(masked_diagonal))
>     # array([3, 4])
>
> and my guess is that the reaction to that code is a: "Wait what?!"
>
> That code might seem reasonable, but it only works if you have the
> exact same number of `True` values in the two indices.
> And if you have the exact same number but two different arrays, then I
> fail to reason about the result without doing the `nonzero` step, which
> I think indicates that there just is no logical concept for it.
>
>
> So, I think we may be better of forcing the few power-user who may have
> found a use for this type of nugget to use `np.nonzero()` or find
> another solution.

Well I'm cautious because despite implementing the logic for all this,
I'm a bit divorced from most use-cases. So I don't have a great
feeling for what is currently being used. For example, is it possible
to have a situation where you build a mask out of an expression, like
a[x > 0] or whatever, where the mask expression could be any number of
dimensions depending on the input values? And if so, does the current
logic for scalar booleans do the right thing when the number of
dimensions happens to be 0.

Mixing nonscalar boolean and integer arrays seems fine, as far as the
logic is concerned. I'm not really sure if it makes sense
semantically. I'll have to think about it more. The thing that has the
most odd corner cases in the indexing logic is boolean scalars. It
would be nice if you could treat them uniformly with the same logic as
other boolean arrays, but they have special cases everywhere. This is
in contrast with integer scalars which perfectly match the logic of
integer arrays with the shape == (). Maybe I'm just not looking at it
from the right angle. I don't know.

In ndindex, I've left the "arrays separated by slices, ellipses, or
newaxes" case unimplemented. Travis Oliphant told me he thinks it was
a mistake and it would be better to not allow it. I've also left
boolean scalars mixed with other arrays unimplemented because I don't
want to waste more time trying to figure out what is going on in the
example I posted earlier (though what you wrote helps). I have
nonscalar boolean arrays mixed with integer arrays working just fine,
and the logic isn't really any different than it would be if I only
supported them separately.

Aaron Meurer

>
> - Sebastian
>
>
> >
> > Aaron Meurer
> >
> > On Thu, Aug 20, 2020 at 3:56 PM Sebastian Berg
> > <sebastian at sipsolutions.net> wrote:
> > > On Thu, 2020-08-20 at 16:50 -0500, Sebastian Berg wrote:
> > > > On Thu, 2020-08-20 at 12:21 -0600, Aaron Meurer wrote:
> > > > > You're right. I was confusing the broadcasting logic for
> > > > > boolean
> > > > > arrays.
> > > > >
> > > > > However, I did find this example
> > > > >
> > > > > > > > np.arange(10).reshape((2, 5))[np.array([[0, 0, 0, 0, 0]],
> > > > > > > > dtype=np.int64), False]
> > > > > Traceback (most recent call last):
> > > > >   File "<stdin>", line 1, in <module>
> > > > > IndexError: shape mismatch: indexing arrays could not be
> > > > > broadcast
> > > > > together with shapes (1,5) (0,)
> > > > >
> > > > > That certainly seems to imply there is some broadcasting being
> > > > > done.
> > > >
> > > > Yes, it broadcasts the array after converting it with `nonzero`,
> > > > i.e.
> > > > its much the same as:
> > > >
> > > >    indices = [[0, 0, 0, 0, 0]], *np.nonzero(False)
> > > >    indices = np.broadcast_arrays(*indices)
> > > >
> > > > will give the same result (see also `np.ix_` which converts
> > > > booleans
> > > > as
> > > > well for this reason, to give you outer indexing).
> > > > I was half way through a mock-up/pseudo code, but thought you
> > > > likely
> > > > wasn't sure it was ending up clear. It sounds like things are
> > > > probably
> > > > falling into place for you (if they are not, let me know what
> > > > might
> > > > help you):
> > >
> > > Sorry editing error up there, in short I hope those steps sense to
> > > you,
> > > note that the broadcasting is basically part of a later "integer
> > > only"
> > > indexing step, and the `nonzero` part is pre-processing.
> > >
> > > > 1. Convert all boolean indices into a series of integer indices
> > > > using
> > > >    `np.nonzero(index)`
> > > >
> > > > 2. For True/False scalars, that doesn't work, because
> > > > `np.nonzero()`.
> > > >
> > > >  `nonzero` gave us an index array (which is good, we obviously
> > > > want
> > > >
> > > > one), but we need to index into `boolean_index.ndim == 0`
> > > >    dimensions!
> > > >    So that won't work, the approach using `nonzero` cannot
> > > > generalize
> > > >
> > > >  here, although boolean indices generalize perfectly.
> > > >
> > > >    The solution to the dilemma is simple: If we have to index one
> > > >    dimension, but should be indexing zero, then we simply add
> > > > that
> > > >    dimension to the original array (or at least pretend there was
> > > >    an additional dimension).
> > > >
> > > > 3. Do normal indexing with the result *including broadcasting*,
> > > >    we forget it was converted.
> > > >
> > > > The other way to solve it would be to always reshape the original
> > > > array
> > > > to combine all axes being indexed by a single boolean index into
> > > > one
> > > > axis and then index it using `np.flatnonzero`.  (But that would
> > > > get a
> > > > different result if you try to broadcast!)
> > > >
> > > >
> > > > In any case, I am not sure I would bother with making sense of
> > > > this,
> > > > except for sports!
> > > > Its pretty much nonsense and I think the time understanding it is
> > > > probably better spend deprecating it.  The only reason I did not
> > > > Deprecate itt before, is that I tried to do be minimal in the
> > > > changes
> > > > when I rewrote advanced indexing (and generalized boolean scalars
> > > > correctly) long ago.  That was likely the right start/choice at
> > > > the
> > > > time, since there were much bigger fish to catch, but I do not
> > > > think
> > > > anything is holding us back now.
> > > >
> > > > Cheers,
> > > >
> > > > Sebastian
> > > >
> > > >
> > > > > Aaron Meurer
> > > > >
> > > > > On Wed, Aug 19, 2020 at 6:55 PM Sebastian Berg
> > > > > <sebastian at sipsolutions.net> wrote:
> > > > > > On Wed, 2020-08-19 at 18:07 -0600, Aaron Meurer wrote:
> > > > > > > > > 3. If you have multiple advanced indexing you get
> > > > > > > > > annoying
> > > > > > > > > broadcasting
> > > > > > > > >    of all of these. That is *always* confusing for
> > > > > > > > > boolean
> > > > > > > > > indices.
> > > > > > > > >    0-D should not be too special there...
> > > > > > >
> > > > > > > OK, now that I am learning more about advanced indexing,
> > > > > > > this
> > > > > > > statement is confusing to me. It seems that scalar boolean
> > > > > > > indices do
> > > > > > > not broadcast. For example:
> > > > > >
> > > > > > Well, broadcasting means you broadcast the *nonzero result*
> > > > > > unless
> > > > > > I am
> > > > > > very confused... There is a reason I dismissed it. We could
> > > > > > (and
> > > > > > arguably should) just deprecate it.  And I have doubts anyone
> > > > > > would
> > > > > > even notice.
> > > > > >
> > > > > > > > > > np.arange(2)[False, np.array([True, False])]
> > > > > > > array([], dtype=int64)
> > > > > > > > > > np.arange(2)[tuple(np.broadcast_arrays(False,
> > > > > > > > > > np.array([True,
> > > > > > > > > > False])))]
> > > > > > > Traceback (most recent call last):
> > > > > > >   File "<stdin>", line 1, in <module>
> > > > > > > IndexError: too many indices for array: array is 1-
> > > > > > > dimensional,
> > > > > > > but 2
> > > > > > > were indexed
> > > > > > >
> > > > > > > And indeed, the docs even say, as you noted, "the nonzero
> > > > > > > equivalence
> > > > > > > for Boolean arrays does not hold for zero dimensional
> > > > > > > boolean
> > > > > > > arrays,"
> > > > > > > which I guess also applies to the broadcasting.
> > > > > >
> > > > > > I actually think that probably also holds. Nonzero just
> > > > > > behave
> > > > > > weird
> > > > > > for 0D because arrays (because it returns a tuple).
> > > > > > But since broadcasting the nonzero result is so weird, and
> > > > > > since
> > > > > > 0-
> > > > > > D
> > > > > > booleans require some additional logic and don't generalize
> > > > > > 100%
> > > > > > (code
> > > > > > wise), I won't rule out there are differences.
> > > > > >
> > > > > > > From what I can tell, the logic is that all integer and
> > > > > > > boolean
> > > > > > > arrays
> > > > > >
> > > > > > Did you try that? Because as I said above, IIRC broadcasting
> > > > > > the
> > > > > > boolean array without first calling `nonzero` isn't really
> > > > > > whats
> > > > > > going
> > > > > > on. And I don't know how it could be whats going on, since
> > > > > > adding
> > > > > > dimensions to a boolean index would have much more
> > > > > > implications?
> > > > > >
> > > > > > - Sebastian
> > > > > >
> > > > > >
> > > > > > > (and scalar ints) are broadcast together, *except* for
> > > > > > > boolean
> > > > > > > scalars. Then the first boolean scalar is replaced with
> > > > > > > and(all
> > > > > > > boolean scalars) and the rest are removed from the index.
> > > > > > > Then
> > > > > > > that
> > > > > > > index adds a length 1 axis if it is True and 0 if it is
> > > > > > > False.
> > > > > > >
> > > > > > > So they don't broadcast, but rather "fake broadcast". I
> > > > > > > still
> > > > > > > contend
> > > > > > > that it would be much more useful, if True were a synonym
> > > > > > > for
> > > > > > > newaxis
> > > > > > > and False worked like newaxis but instead added a length 0
> > > > > > > axis.
> > > > > > > Alternately, True and False scalars should behave exactly
> > > > > > > like
> > > > > > > all
> > > > > > > other boolean arrays with no exceptions (i.e., work like
> > > > > > > np.nonzero(),
> > > > > > > broadcast, etc.). This would be less useful, but more
> > > > > > > consistent.
> > > > > > >
> > > > > > > Aaron Meurer
> > > > > > > _______________________________________________
> > > > > > > NumPy-Discussion mailing list
> > > > > > > NumPy-Discussion at python.org
> > > > > > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > > > > >
> > > > > >
> > > > > > _______________________________________________
> > > > > > NumPy-Discussion mailing list
> > > > > > NumPy-Discussion at python.org
> > > > > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > > > _______________________________________________
> > > > > NumPy-Discussion mailing list
> > > > > NumPy-Discussion at python.org
> > > > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > > >
> > > >
> > > > _______________________________________________
> > > > NumPy-Discussion mailing list
> > > > NumPy-Discussion at python.org
> > > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > >
> > > _______________________________________________
> > > NumPy-Discussion mailing list
> > > NumPy-Discussion at python.org
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at python.org
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> >
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion