[Numpy-discussion] missing data discussion round 2

Tue Jun 28 19:28:38 EDT 2011

On Tue, Jun 28, 2011 at 10:06 AM, Nathaniel Smith <njs at pobox.com> wrote:

> On Mon, Jun 27, 2011 at 2:03 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:
> > On Mon, Jun 27, 2011 at 12:18 PM, Matthew Brett <matthew.brett at gmail.com
> >
> > wrote:
> >> You won't get complaints, you'll just lose a group of users, who will,
> >> I suspect, stick to NaNs, unsatisfactory as they are.
> >
> > This blade cuts both ways, we'd lose a group of users if we don't support
> > masking semantics, too.
>
> The problem is, that's inevitable. One might think that trying to find
> a compromise solution that picks a few key aspects of each approach
> would be a good way to make everyone happy, but in my experience, it
> mostly leads to systems that are a muddled mess and that make everyone
> unhappy. You're much better off saying screw it, these goals are in
> scope and those ones aren't, and we're going to build something
> consistent and powerful instead of focusing on how long the feature
> list is. That's also the problem with focusing too much on a list of
> use cases: you might capture everything on any single list, but there
> are actually an infinite variety of use cases that will arise in the
> future. If you can generalize beyond the use cases to find some simple
> and consistent mental model, and implement that, then that'll work for
> all those future use cases too. But sometimes that requires deciding
> what *not* to implement.
>
> Just my opinion, but it's fairly hard won.
>

I don't think the solution I'm proposing is any of muddled, a mess, or a
compromise. There are still rough edges to be worked out, but that's the
nature of the design process.

Anyway, it's pretty clear that in this particular case, there are two
> distinct features that different people want: the missing data
> feature, and the masked array feature.

I don't believe these are different, it seems to me that people wanting
masked arrays want missing data without touching their data in nearly all
their use cases. If people have use cases that can't be handled with this
approach, it would be nice to have specific examples the current NEP fails
to address.

> The more I think about it, the
> less I see how they can be combined into one dessert topping + floor
> wax solution. Here are three particular points where they seem to
> contradict each other:
>
> Missing data: We think memory usage is critical. The ideal solution
> has zero overhead. If we can't get that, then at the very least we
> want the overhead to be 1 bit/item instead of 1 byte/item.
> Masked arrays: We say, it's critical to have good ways to manipulate
> the masking array, share it between multiple arrays, and so forth. And
> numpy already has great support for all those things! So obviously the
> masking array should be exposed as a standard ndarray.
>
> Missing data: Once you've assigned NA to a value, you should *not* be
> able to get at what was stored there before.
> Masked arrays: You must be able to unmask a value and recover what was
> stored there before.
>

My current proposal provides both of these at the same time. It does not
allow you to unmask a value without also setting the value stored in its
element memory.

(You might think, what difference does it make if you *can* unmask an
> item? Us missing data folks could just ignore this feature. But:
> whatever we end up implementing is something that I will have to
> explain over and over to different people, most of them not
> particularly sophisticated programmers. And there's just no sensible
> way to explain this idea that if you store some particular value, then
> it replaces the old value, but if you store NA, then the old value is
> still there. They will get confused, and then store it away as another
> example of how computers are arbitrary and confusing and they're just
> too dumb to understand them, and I *hate* doing that to people. Plus
> the more that happens, the more they end up digging themselves into
> some hole by trying things at random, and then I have to dig them out
> again. So the point is, we can go either way, but in both ways there
> *is* a cost, and we have to decide.)
>
> Missing data: It's critical that NAs propagate through reduction
> operations by default, though there should also be some way to turn
> this off.
> Masked arrays: Masked values should be silently ignored by reduction
> operations, and having to remember to pass a special flag to turn on
> this behavior on every single ufunc call would be a huge pain.
>

This isn't a difference between "missing data" and "masked arrays", you're
describing the two ways of thinking about missing data in the NEP.

(Masked array advocates: please correct me if I'm misrepresenting you
> anywhere above!)
>
> > That said, Travis favors doing both, so there's a good chance there will
> be
> > time for it.
>
> One issue with the current draft is that I don't see any addressing of
> how masking-missing and bit-pattern-missing interact:
>  a = np.zeros(10, dtype="NA[f8]")
>  a.flags.hasmask = True
>  a[5] = np.NA   # Now what?
>

Yes, this is a rough edge that needs to be worked out. Probably the array
mask should be more primal. Another question is if you add an array with a
mask and an array with an "NA[]" dtype, which missing data mechanism should
be produced.

> If you're going to implement both things anyway, and you need to
> figure out how they interact anyway, then why not split them up into
> two totally separate features?
>

Because they're both approaches for dealing with missing data.

-Mark

Here's my proposal:
> 1) Add a purely dtype-based support for missing data:
> 1.A) Add some flags/metadata to the dtype structure to let it describe
> what a missing value looks like for an element of its type. Something
> like, an example NA value plus a function that can be called to
> identify NAs when they occur in arrays. (Notice that this interface is
> general enough to handle both the bit-stealing approach and the
> maybe() approach.)
> 1.B) Add an np.NA object, and teach the various coercion loops to use
> the above fields in the dtype structure to handle it.
> 1.C) Teach the various reduction loops that if a particular flag is
> set in the dtype, then they also should check for NAs and handle them
> appropriately. (If this flag is not set, then it means that this
> dtype's ufunc loops are already NA aware and the generic machinery is
> not needed unless skipmissing=True is given. This is useful for
> user-defined dtypes, and probably also a nice optimization for floats
> using NaN.)
> 1.D) Finally, as a convenience, add some standard NA-aware dtypes.
> Personally, I wouldn't bother with complicated string-based
> mini-language described in the current NEP; just define some standard
> NA-enabled dtype objects in the numpy namespace or provide a function
> that takes a dtype + a NA bit-pattern and spits out an NA-enabled
> dtype or whatever.
>
> 2) Add a better masked array support.
> 2.A) Masked arrays are simply arrays with an extra attribute
> '.visible', which is an arbitrary numpy array that is broadcastable to
> the same shape as the masked array. There's no magic here -- if you
> say a.visible = b.visible, then they now share a visibility array,
> according to the ordinary rules of Python assignment. (Well, there
> needs to be some check for shape compatibility, but that's not much
> magic.)
> 2.B) To minimize confusion with the missing value support, the way you
> mask/unmask items is through expressions like 'a.visible[10] = False';
> there is no magic np.masked object. (There are a few options for what
> happens when you try to use scalar indexing explicitly to extract an
> invisible value -- you could return the actual value from behind the
> mask, or throw an error, or return a scalar masked array whose
> .visible attribute was a scalar array containing False. I don't know
> what the people who actually use this stuff would prefer :-).)
> 2.C) Indexing and shape-changing operations on the masked array are
> automatically applied to the .visible array as well. (Attempting to
> call .resize() on an array which is being used as the .visible
> attribute of some other array is an error.)
> 2.D) Ufuncs on masked arrays always ignore invisible items. We can
> probably share some code here between the handling of skipmissing=True
> for NA-enabled dtypes and invisible items in masked arrays, but that's
> purely an implementation detail.
>
> This approach to masked arrays requires that the ufunc machinery have
> some special knowledge of what a masked array is, so masked arrays
> would have to become part of the core. I'm not sure whether or not
> they should be part of the np.ndarray base class or remain as a
> subclass, though. There's an argument that they're more of a
> convenience feature like np.matrix, and code which interfaces between
> ndarray's and C becomes more complicated if it has to be prepared to
> handle visibility. (Note that in contrast, ndarray's can already
> contain arbitrary user-defined dtypes, so the missing value support
> proposed here doesn't add any new issues to C interfacing.) So maybe
> it'd be better to leave it as a core supported subclass? Could go
> either way.
>
> -- Nathaniel
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110628/a0ac2e94/attachment.html>