[Numpy-discussion] Concepts for masked/missing data

Sat Jun 25 12:26:25 EDT 2011

Hi,

On Sat, Jun 25, 2011 at 5:05 PM, Nathaniel Smith <njs at pobox.com> wrote:
> So obviously there's a lot of interest in this question, but I'm
> losing track of all the different issues that've being raised in the
> 150-post thread of doom. I think I'll find this easier if we start by
> putting aside the questions about implementation and such and focus
> for now on the *conceptual model* that we want. Maybe I'm not the only
> one?
>
> So as far as I can tell, there are three different ways of thinking
> about masked/missing data that people have been using in the other
> thread:
>
> 1) Missingness is part of the data. Some data is missing, some isn't,
> this might change through computation on the data (just like some data
> might change from a 3 to a 6 when we apply some transformation, NA |
> True could be True, instead of NA), but we can't just "decide" that
> some data is no longer missing. It makes no sense to ask what value is
> "really" there underneath the missingness. And It's critical that we
> keep track of this through all operations, because otherwise we may
> silently give incorrect answers -- exactly like it's critical that we
> keep track of the difference between 3 and 6.

So far I see the difference between 1) and 2) being that you cannot
unmask.  So, if you didn't even know you could unmask data, then it
would not matter that 1) was being implemented by masks?

> 2) All the data exists, at least in some sense, but we don't always
> want to look at all of it. We lay a mask over our data to view and
> manipulate only parts of it at a time. We might want to use different
> masks at different times, mutate the mask as we go, etc. The most
> important thing is to provide convenient ways to do complex
> manipulations -- preserve masks through indexing operations, overlay
> the mask from one array on top of another array, etc. When it comes to
> other sorts of operations then we'd rather just silently skip the
> masked values -- we know there are values that are masked, that's the
> whole point, to work with the unmasked subset of the data, so if sum
> returned NA then that would just be a stupid hassle.

To clarify, you're proposing for:

a = np.sum(np.array([np.NA, np.NA])

1) -> np.NA
2) -> 0.0

?

> But that's just my opinion. I'm wondering if we can get any consensus
> on which of these we actually *want* (or maybe we want some fourth
> option!), and *then* we can try to figure out the best way to get
> there? Pretty much any implementation strategy we've talked about
> could work for any of these, but hard to decide between them if we
> don't even know what we're trying to do...

I agree it's good to separate the API from the implementation.   I
think the implementation is also important because I care about memory
and possibly speed.  But, that is a separate problem from the API...

Cheers,

Matthew