[Numpy-discussion] use for missing (ignored) data?

Nathaniel Smith njs at pobox.com
Wed Mar 7 18:41:01 EST 2012


On Wed, Mar 7, 2012 at 8:05 PM, Neal Becker <ndbecker2 at gmail.com> wrote:
> I'm wondering what is the use for the ignored data feature?
>
> I can use:
>
> A[valid_A_indexes] = whatever
>
> to process only the 'non-ignored' portions of A.  So at least some simple cases
> of ignored data are already supported without introducing a new type.
>
> OTOH:
>
> w = A[valid_A_indexes]
>
> will copy A's data, and subsequent use of
>
> w[:] = something
>
> will not update A.
>
> Is this the reason for wanting the ignored data feature?

Hi Neal,

There are a few reasons that I know of why people want more support
from numpy for ignored data/masks, specifically (as opposed to missing
data or other related concepts):

1) If you're often working on some subset of your data, then it's
convenient to set the mask once and have it stay in effect for further
operations. Anything you can accomplish this way can also be
accomplished by keeping an explicit mask array and using it for
indexing "by hand", but in some situations it may be more convenient
not to.

2) Operating on subsets of an array without making a copy. Like
Benjamin pointed out, indexing with a mask makes a copy. This is slow,
and what's worse, people who work with large data sets (e.g., big fMRI
volumes) may not have enough memory to afford such a copy. This
problem can be solved by using the new where= argument to ufuncs
(which skips the copy). (But then see (1) -- passing where= to a bunch
of functions takes more typing than just setting it once and leaving
it.)

3) Suppose there's a 3rd-party function that takes an array --
borrowing Charles example, say it's draw_points(arr). Now you want to
apply it to just a subset of your data, and want to avoid a copy. It
would be nice if the original author had made it draw_points(arr,
mask), but they didn't. Well, if you have masking "built in" to your
array type, then maybe you can call this as draw_points(masked_arr)
and it will Just Work. I.e., maybe people who aren't thinking about
masking will sometimes write code that accidentally works with masking
anyway. I'm not sure how much I'd trust this, but I guess it's nice
when it happens. And if it does work, then implementing the show/hide
point functionality will be easier. (And if it doesn't work, and
masking is built into numpy.ndarray, then maybe you can use this to
argue with the original author that this is a bug, not just a missing
feature. Again, I'm not sure if this is a good thing on net: one could
argue that people shouldn't be forced to think about masking every
time they write any function, just in case it becomes relevant later.
But certainly it'd be useful sometimes.)

There may be other motivations that I'm not aware of, of course.

-- Nathaniel



More information about the NumPy-Discussion mailing list