[Numpy-discussion] Concepts for masked/missing data

Sat Jun 25 12:48:05 EDT 2011

On Sat, Jun 25, 2011 at 10:26 AM, Matthew Brett <matthew.brett at gmail.com>wrote:

> Hi,
>
> On Sat, Jun 25, 2011 at 5:05 PM, Nathaniel Smith <njs at pobox.com> wrote:
> > So obviously there's a lot of interest in this question, but I'm
> > losing track of all the different issues that've being raised in the
> > 150-post thread of doom. I think I'll find this easier if we start by
> > putting aside the questions about implementation and such and focus
> > for now on the *conceptual model* that we want. Maybe I'm not the only
> > one?
> >
> > So as far as I can tell, there are three different ways of thinking
> > about masked/missing data that people have been using in the other
> > thread:
> >
> > 1) Missingness is part of the data. Some data is missing, some isn't,
> > this might change through computation on the data (just like some data
> > might change from a 3 to a 6 when we apply some transformation, NA |
> > True could be True, instead of NA), but we can't just "decide" that
> > some data is no longer missing. It makes no sense to ask what value is
> > "really" there underneath the missingness. And It's critical that we
> > keep track of this through all operations, because otherwise we may
> > silently give incorrect answers -- exactly like it's critical that we
> > keep track of the difference between 3 and 6.
>
> So far I see the difference between 1) and 2) being that you cannot
> unmask.  So, if you didn't even know you could unmask data, then it
> would not matter that 1) was being implemented by masks?
>
> > 2) All the data exists, at least in some sense, but we don't always
> > want to look at all of it. We lay a mask over our data to view and
> > manipulate only parts of it at a time. We might want to use different
> > masks at different times, mutate the mask as we go, etc. The most
> > important thing is to provide convenient ways to do complex
> > manipulations -- preserve masks through indexing operations, overlay
> > the mask from one array on top of another array, etc. When it comes to
> > other sorts of operations then we'd rather just silently skip the
> > masked values -- we know there are values that are masked, that's the
> > whole point, to work with the unmasked subset of the data, so if sum
> > returned NA then that would just be a stupid hassle.
>
> To clarify, you're proposing for:
>
> a = np.sum(np.array([np.NA, np.NA])
>
> 1) -> np.NA
> 2) -> 0.0
>
> ?
>
> > But that's just my opinion. I'm wondering if we can get any consensus
> > on which of these we actually *want* (or maybe we want some fourth
> > option!), and *then* we can try to figure out the best way to get
> > there? Pretty much any implementation strategy we've talked about
> > could work for any of these, but hard to decide between them if we
> > don't even know what we're trying to do...
>
> I agree it's good to separate the API from the implementation.   I
> think the implementation is also important because I care about memory
> and possibly speed.  But, that is a separate problem from the API...
>
>
In a larger sense, we are seeking to add metadata to array elements and have
ufuncs that use that metadata together with the element values to compute
results. Off topic a bit, but it reminds me of the Burroughs 6600 that I
once used. The word size on that machine was 48 bits, so it could
accommodate both  6 and 8 bit characters, and 3 bits of metadata were
appended to mark the type. So there was a machine with 51 bit words ;) IIRC,
Knuth was involved in the design and helped with the OS, which was written
in ALGOL...

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110625/f1828e53/attachment.html>