[Numpy-discussion] NA masks in the next numpy release?

Fri Oct 28 13:39:07 EDT 2011

Hi,

On Thu, Oct 27, 2011 at 10:56 PM, Benjamin Root <ben.root at ou.edu> wrote:
>
>
> On Thursday, October 27, 2011, Charles R Harris <charlesr.harris at gmail.com>
> wrote:
>>
>>
>> On Thu, Oct 27, 2011 at 7:16 PM, Travis Oliphant <oliphant at enthought.com>
>> wrote:
>>>
>>> That is a pretty good explanation.   I find myself convinced by Matthew's
>>> arguments.    I think that being able to separate ABSENT from IGNORED is a
>>> good idea.   I also like being able to control SKIP and PROPAGATE (but I
>>> think the current implementation allows this already).
>>>
>>> What is the counter-argument to this proposal?
>>>
>>
>> What exactly do you find convincing? The current masks propagate by
>> default:
>>
>> In [1]: a = ones(5, maskna=1)
>>
>> In [2]: a[2] = NA
>>
>> In [3]: a
>> Out[3]: array([ 1.,  1.,  NA,  1.,  1.])
>>
>> In [4]: a + 1
>> Out[4]: array([ 2.,  2.,  NA,  2.,  2.])
>>
>> In [5]: a[2] = 10
>>
>> In [5]: a
>> Out[5]: array([  1.,   1.,  10.,   1.,   1.], maskna=True)
>>
>>
>> I don't see an essential difference between the implementation using masks
>> and one using bit patterns, the mask when attached to the original array
>> just adds a bit pattern by extending all the types by one byte, an approach
>> that easily extends to all existing and future types, which is why Mark went
>> that way for the first implementation given the time available. The masks
>> are hidden because folks wanted something that behaved more like R and also
>> because of the desire to combine the missing, ignore, and later possibly bit
>> patterns in a unified manner. Note that the pseudo assignment was also meant
>> to look like R. Adding true bit patterns to numpy isn't trivial and I
>> believe Mark was thinking of parametrized types for that.
>>
>> The main problems I see with masks are unified storage and possibly memory
>> use. The rest is just behavor and desired API and that can be adjusted
>> within the current implementation. There is nothing essentially masky about
>> masks.
>>
>> Chuck
>>
>>
>
> I  think chuck sums it up quite nicely.  The implementation detail about
> using mask versus bit patterns can still be discussed and addressed.
> Personally, I just don't see how parameterized dtypes would be easier to use
> than the pseudo assignment.
>
> The elegance of mark's solution was to consider the treatment of missing
> data in a unified manner.  This puts missing data in a more prominent spot
> for extension builders, which should greatly improve support throughout the
> ecosystem.

Are extension builders then required to use the numpy C API to get
their data?  Speaking as an extension builder, I would rather you gave
me the mask and the bitpattern information and let me do that myself.

> By letting there be a single missing data framework (instead of
> two) all that users need to figure out is when they want nan-like behavior
> (propagate) or to be more like masks (skip).  Numpy takes care of the rest.
>  There is a reason why I like using masked arrays because I don't have to
> use nansum in my library functions to guard against the possibility of
> receiving nans.  Duck-typing is a good thing.
>
> My argument against separating IGNORE and PROPAGATE is that it becomes too
> tempting to want to mix these in an array, but the desired behavior would
> likely become ambiguous..
>
> There is one other proplem that I just thought of that I don't think has
> been outlined in either NEP.  What if I perform an operation between an
> array set up with propagate NAs and an array with skip NAs?

These are explicitly covered in the alterNEP:

https://gist.github.com/1056379/

Best,

Matthew