[Numpy-discussion] NA masks in the next numpy release?

Fri Oct 28 14:37:38 EDT 2011

Hi,

On Fri, Oct 28, 2011 at 11:16 AM, Benjamin Root <ben.root at ou.edu> wrote:
> On Fri, Oct 28, 2011 at 12:39 PM, Matthew Brett <matthew.brett at gmail.com>
> wrote:
>>
>> Hi,
>>
>> On Thu, Oct 27, 2011 at 10:56 PM, Benjamin Root <ben.root at ou.edu> wrote:
>> >
>> >
>> > On Thursday, October 27, 2011, Charles R Harris
>> > <charlesr.harris at gmail.com>
>> > wrote:
>> >>
>> >>
>> >> On Thu, Oct 27, 2011 at 7:16 PM, Travis Oliphant
>> >> <oliphant at enthought.com>
>> >> wrote:
>> >>>
>> >>> That is a pretty good explanation.   I find myself convinced by
>> >>> Matthew's
>> >>> arguments.    I think that being able to separate ABSENT from IGNORED
>> >>> is a
>> >>> good idea.   I also like being able to control SKIP and PROPAGATE (but
>> >>> I
>> >>> think the current implementation allows this already).
>> >>>
>> >>> What is the counter-argument to this proposal?
>> >>>
>> >>
>> >> What exactly do you find convincing? The current masks propagate by
>> >> default:
>> >>
>> >> In [1]: a = ones(5, maskna=1)
>> >>
>> >> In [2]: a[2] = NA
>> >>
>> >> In [3]: a
>> >> Out[3]: array([ 1.,  1.,  NA,  1.,  1.])
>> >>
>> >> In [4]: a + 1
>> >> Out[4]: array([ 2.,  2.,  NA,  2.,  2.])
>> >>
>> >> In [5]: a[2] = 10
>> >>
>> >> In [5]: a
>> >> Out[5]: array([  1.,   1.,  10.,   1.,   1.], maskna=True)
>> >>
>> >>
>> >> I don't see an essential difference between the implementation using
>> >> masks
>> >> and one using bit patterns, the mask when attached to the original
>> >> array
>> >> just adds a bit pattern by extending all the types by one byte, an
>> >> approach
>> >> that easily extends to all existing and future types, which is why Mark
>> >> went
>> >> that way for the first implementation given the time available. The
>> >> masks
>> >> are hidden because folks wanted something that behaved more like R and
>> >> also
>> >> because of the desire to combine the missing, ignore, and later
>> >> possibly bit
>> >> patterns in a unified manner. Note that the pseudo assignment was also
>> >> meant
>> >> to look like R. Adding true bit patterns to numpy isn't trivial and I
>> >> believe Mark was thinking of parametrized types for that.
>> >>
>> >> The main problems I see with masks are unified storage and possibly
>> >> memory
>> >> use. The rest is just behavor and desired API and that can be adjusted
>> >> within the current implementation. There is nothing essentially masky
>> >> about
>> >> masks.
>> >>
>> >> Chuck
>> >>
>> >>
>> >
>> > I  think chuck sums it up quite nicely.  The implementation detail about
>> > using mask versus bit patterns can still be discussed and addressed.
>> > Personally, I just don't see how parameterized dtypes would be easier to
>> > use
>> > than the pseudo assignment.
>> >
>> > The elegance of mark's solution was to consider the treatment of missing
>> > data in a unified manner.  This puts missing data in a more prominent
>> > spot
>> > for extension builders, which should greatly improve support throughout
>> > the
>> > ecosystem.
>>
>> Are extension builders then required to use the numpy C API to get
>> their data?  Speaking as an extension builder, I would rather you gave
>> me the mask and the bitpattern information and let me do that myself.
>>
>
> Forgive me, I wasn't clear.  What I am speaking of is more about a typical
> human failing.  If a programmer for a module never encounters masked arrays,
> then when they code up a function to operate on numpy data, it is quite
> likely that they would never take it into consideration.  Notice the
> prolific use of "np.asarray()" even within the numpy codebase, which
> destroys masked arrays.

Hmm - that sounds like it could cause some surprises.

So, what you were saying was just that it was good that masked arrays
were now closer to the core?   That's reasonable, but I don't think
it's relevant to the current discussion.  I think we all agree it is
nice to have masked arrays in the core.

> However, by making missing data support more integral into the core of
> numpy, then it is far more likely that a programmer would take it into
> consideration when designing their algorithm, or at least explicitly
> document that their module does not support missing data.  Both NEPs does
> this by making missing data front-and-center.  However, my belief is that
> Mark's approach is easier to comprehend and is cleaner.  Cleaner features
> means that it is more likely to be used.

The main motivation for the alterNEP was our strong feeling that
separating ABSENT and IGNORE was easier to comprehend and cleaner.  I
think it would be hard to argue that the aterNEP idea is not more
explicit.

>> > By letting there be a single missing data framework (instead of
>> > two) all that users need to figure out is when they want nan-like
>> > behavior
>> > (propagate) or to be more like masks (skip).  Numpy takes care of the
>> > rest.
>> >  There is a reason why I like using masked arrays because I don't have
>> > to
>> > use nansum in my library functions to guard against the possibility of
>> > receiving nans.  Duck-typing is a good thing.
>> >
>> > My argument against separating IGNORE and PROPAGATE is that it becomes
>> > too
>> > tempting to want to mix these in an array, but the desired behavior
>> > would
>> > likely become ambiguous..
>> >
>> > There is one other proplem that I just thought of that I don't think has
>> > been outlined in either NEP.  What if I perform an operation between an
>> > array set up with propagate NAs and an array with skip NAs?
>>
>> These are explicitly covered in the alterNEP:
>>
>> https://gist.github.com/1056379/
>>
>
> Sort of.  You speak of reduction operations for a single array with a mix of
> NA and IGNOREs.  I guess in that case, it wouldn't make a difference for
> element-wise operations between two arrays (plus adding the NAs propagate
> harder rule).  Although, what if skipna=True?  I guess I would feel better
> seeing explicit examples for different combinations of settings (plus, how
> would one set those for math operators?).  In this case, I have a problem
> with this mixed situation.  I would think that IGNORE + NA = IGNORE, because
> if you are skipping it, then it is skipped, regardless of the other side of
> the operator.  (precedence: a masked array summed against an array of NANs).

I'm using IGNORED as a type of value.  What you do to that value
depends on what you said to do to that value.  You might want to SKIP
that type of value, or PROPAGATE.

If you said to 'skip' IGNORED but 'propagate' ABSENT, then IGNORED +
ABSENT == ABSENT.   I think it isn't ambiguous, but I'm happy to be
corrected.

Best,

Matthew