[Numpy-discussion] Missing data again

Tue Mar 6 15:25:09 EST 2012

On Sat, Mar 3, 2012 at 8:30 PM, Travis Oliphant <travis at continuum.io> wrote:
> Hi all,

Hi Travis,

Thanks for bringing this back up.

Have you looked at the summary from the last thread?
  https://github.com/njsmith/numpy/wiki/NA-discussion-status
The goal was to try and at least work out what points we all *could*
agree on, to have some common footing for further discussion. I won't
copy the whole thing here, but I'd summarize the state as:
  -- It's pretty clear that there are two fairly different conceptual
models/use cases in play here. For one of them (R-style "missing data"
cases) it's pretty clear what the desired semantics would be. For the
other (temporary "ignored values") there's still substantive
disagreement.
  -- We *haven't* yet established what we want numpy to actually support.

IMHO the critical next step is this latter one -- maybe we want to
fully support both use cases. Maybe it's really only one of them
that's worth trying to support in the numpy core right now. Maybe it's
just one of them, but it's worth doing so thoroughly that it should
have multiple implementations. Or whatever.

I fear that if we don't talk about these big picture questions and
just wade directly back into round-and-round arguments about API
details then we'll never get anywhere.

[...]
> Because it is slated to go into release 1.7, we need to re-visit the masked array discussion again.    The NEP process is the appropriate one and I'm glad we are taking that route for these discussions.   My goal is to get consensus in order for code to get into NumPy (regardless of who writes the code).    It may be that we don't come to a consensus (reasonable and intelligent people can disagree on things --- look at the coming election...).   We can represent different parts of what is fortunately a very large user-base of NumPy users.
>
> First of all, I want to be clear that I think there is much great work that has been done in the current missing data code.  There are some nice features in the where clause of the ufunc and the machinery for the iterator that allows re-using ufunc loops that are not re-written to check for missing data.   I'm sure there are other things as well that I'm not quite aware of yet.    However, I don't think the API presented to the numpy user presently is the correct one for NumPy 1.X.
>
> A few particulars:
>
>        * the reduction operations need to default to "skipna" --- this is the most common use case which has been re-inforced again to me today by a new user to Python who is using masked arrays presently

This is one of the points where the two conceptual models disagree
(see also Skipper's point down-thread). If you have "missing data",
then propagation has to be the default -- the sum of 1, 2, and
I-DON'T-KNOW-MAYBE-7-MAYBE-12 is not 3. If you have some data there
but you've asked numpy to temporarily ignore it, then, well, duh, of
course it should ignore it.

>        * the mask needs to be visible to the user if they use that approach to missing data (people should be able to get a hold of the mask and work with it in Python)

This is also a point where the two conceptual models disagree.

Actually this is one of the original arguments we made against the NEP
design -- that if you want missing data, then having a mask at all is
counterproductive, and if you are ignoring data, then of course it
should be easy to manipulate the ignore mask. The rationale for the
current design is to compromise between these two approaches -- there
is a mask, but it's hidden behind a curtain. Mostly. (This may be a
compromise in the Solomonic sense.)

>        * bit-pattern approaches to missing data (at least for float64 and int32) need to be implemented.
>
>        * there should be some way when using "masks" (even if it's hidden from most users) for missing data to separate the low-level ufunc operation from the operation
>           on the masks...

I don't understand what this means.

> I have heard from several users that they will *not use the missing data* in NumPy as currently implemented, and I can now see why.    For better or for worse, my approach to software is generally very user-driven and very pragmatic.  On the other hand, I'm also a mathematician and appreciate the cognitive compression that can come out of well-formed structure.    None-the-less, I'm an *applied* mathematician and am ultimately motivated by applications.
>
> I will get a hold of the NEP and spend some time with it to discuss some of this in that document.   This will take several weeks (as PyCon is next week and I have a tutorial I'm giving there).    For now, I do not think 1.7 can be released unless the masked array is labeled *experimental*.

In project management terms, I see three options:
1) Put a big warning label on the functionality and leave it for now
("If this option is given, np.asarray returns a masked array. NOTE: IN
THE NEXT RELEASE, IT MAY INSTEAD RETURN A BAG OF RABID, HUNGRY
WEASELS. NO GUARANTEES.")
2) Move the code back out of mainline and into a branch until until
there's consensus.
3) Hold up the release until this is all sorted.

I come from the project-management school that says you should always
have a releasable mainline, keep unready code in branches, and never
hold up the release for features, so (2) seems obvious to me. But I
seem to be very much in the minority on that[1], so oh well :-). I
don't have any objection to (1), personally. (3) seems like a bad
idea. Just my 2 pence.

-- Nathaniel

[1] See replies here:
http://thread.gmane.org/gmane.comp.python.numeric.general/46460/focus=46546