[Numpy-discussion] Concepts for masked/missing data

Sat Jun 25 12:17:22 EDT 2011

On Sat, Jun 25, 2011 at 10:05 AM, Nathaniel Smith <njs at pobox.com> wrote:

> So obviously there's a lot of interest in this question, but I'm
> losing track of all the different issues that've being raised in the
> 150-post thread of doom. I think I'll find this easier if we start by
> putting aside the questions about implementation and such and focus
> for now on the *conceptual model* that we want. Maybe I'm not the only
> one?
>
> So as far as I can tell, there are three different ways of thinking
> about masked/missing data that people have been using in the other
> thread:
>
> 1) Missingness is part of the data. Some data is missing, some isn't,
> this might change through computation on the data (just like some data
> might change from a 3 to a 6 when we apply some transformation, NA |
> True could be True, instead of NA), but we can't just "decide" that
> some data is no longer missing. It makes no sense to ask what value is
> "really" there underneath the missingness. And It's critical that we
> keep track of this through all operations, because otherwise we may
> silently give incorrect answers -- exactly like it's critical that we
> keep track of the difference between 3 and 6.
>
> 2) All the data exists, at least in some sense, but we don't always
> want to look at all of it. We lay a mask over our data to view and
> manipulate only parts of it at a time. We might want to use different
> masks at different times, mutate the mask as we go, etc. The most
> important thing is to provide convenient ways to do complex
> manipulations -- preserve masks through indexing operations, overlay
> the mask from one array on top of another array, etc. When it comes to
> other sorts of operations then we'd rather just silently skip the
> masked values -- we know there are values that are masked, that's the
> whole point, to work with the unmasked subset of the data, so if sum
> returned NA then that would just be a stupid hassle.
>
> 3) The "all things to all people" approach: implement every feature
> implied by either (1) or (2), and switch back and forth between these
> conceptual frameworks whenever necessary to make sense of the
> resulting code.
>
> The advantage of deciding up front what our model is is that it makes
> a lot of other questions easier. E.g., someone asked in the other
> thread whether, after setting an array element to NA, it would be
> possible to get back the original value. If we follow (1), the answer
> is obviously "no", if we follow (2), the answer is obviously "yes",
> and if we follow (3), the answer is obviously "yes, probably, well,
> maybe you better check the docs?".
>
> My personal opinions on these are:
> (1): This is a real problem I face, and there isn't any good solution
> now. Support for this in numpy would be awesome.
> (2): This feels more like a convenience feature to me; we already have
> lots of ways to work with subsets of data. I probably wouldn't bother
> using it, but that's fine -- I don't use np.matrix either, but some
> people like it.
> (3): Well, it's a bit of a mess, but I guess it might be better than
> nothing?
>
> But that's just my opinion. I'm wondering if we can get any consensus
> on which of these we actually *want* (or maybe we want some fourth
> option!), and *then* we can try to figure out the best way to get
> there? Pretty much any implementation strategy we've talked about
> could work for any of these, but hard to decide between them if we
> don't even know what we're trying to do...
>
>
I go for 3 ;) And I think that is where we are heading. By default, masked
array operations look like 1), but by taking views one can get 2). I think
the crucial aspect here is the use of views, which both saves on storage and
fits with the current numpy concept of views.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110625/760d950f/attachment.html>