[Numpy-discussion] NA masks in the next numpy release?
Matthew Brett
matthew.brett at gmail.com
Fri Oct 28 13:39:07 EDT 2011
Hi,
On Thu, Oct 27, 2011 at 10:56 PM, Benjamin Root <ben.root at ou.edu> wrote:
>
>
> On Thursday, October 27, 2011, Charles R Harris <charlesr.harris at gmail.com>
> wrote:
>>
>>
>> On Thu, Oct 27, 2011 at 7:16 PM, Travis Oliphant <oliphant at enthought.com>
>> wrote:
>>>
>>> That is a pretty good explanation. I find myself convinced by Matthew's
>>> arguments. I think that being able to separate ABSENT from IGNORED is a
>>> good idea. I also like being able to control SKIP and PROPAGATE (but I
>>> think the current implementation allows this already).
>>>
>>> What is the counter-argument to this proposal?
>>>
>>
>> What exactly do you find convincing? The current masks propagate by
>> default:
>>
>> In [1]: a = ones(5, maskna=1)
>>
>> In [2]: a[2] = NA
>>
>> In [3]: a
>> Out[3]: array([ 1., 1., NA, 1., 1.])
>>
>> In [4]: a + 1
>> Out[4]: array([ 2., 2., NA, 2., 2.])
>>
>> In [5]: a[2] = 10
>>
>> In [5]: a
>> Out[5]: array([ 1., 1., 10., 1., 1.], maskna=True)
>>
>>
>> I don't see an essential difference between the implementation using masks
>> and one using bit patterns, the mask when attached to the original array
>> just adds a bit pattern by extending all the types by one byte, an approach
>> that easily extends to all existing and future types, which is why Mark went
>> that way for the first implementation given the time available. The masks
>> are hidden because folks wanted something that behaved more like R and also
>> because of the desire to combine the missing, ignore, and later possibly bit
>> patterns in a unified manner. Note that the pseudo assignment was also meant
>> to look like R. Adding true bit patterns to numpy isn't trivial and I
>> believe Mark was thinking of parametrized types for that.
>>
>> The main problems I see with masks are unified storage and possibly memory
>> use. The rest is just behavor and desired API and that can be adjusted
>> within the current implementation. There is nothing essentially masky about
>> masks.
>>
>> Chuck
>>
>>
>
> I think chuck sums it up quite nicely. The implementation detail about
> using mask versus bit patterns can still be discussed and addressed.
> Personally, I just don't see how parameterized dtypes would be easier to use
> than the pseudo assignment.
>
> The elegance of mark's solution was to consider the treatment of missing
> data in a unified manner. This puts missing data in a more prominent spot
> for extension builders, which should greatly improve support throughout the
> ecosystem.
Are extension builders then required to use the numpy C API to get
their data? Speaking as an extension builder, I would rather you gave
me the mask and the bitpattern information and let me do that myself.
> By letting there be a single missing data framework (instead of
> two) all that users need to figure out is when they want nan-like behavior
> (propagate) or to be more like masks (skip). Numpy takes care of the rest.
> There is a reason why I like using masked arrays because I don't have to
> use nansum in my library functions to guard against the possibility of
> receiving nans. Duck-typing is a good thing.
>
> My argument against separating IGNORE and PROPAGATE is that it becomes too
> tempting to want to mix these in an array, but the desired behavior would
> likely become ambiguous..
>
> There is one other proplem that I just thought of that I don't think has
> been outlined in either NEP. What if I perform an operation between an
> array set up with propagate NAs and an array with skip NAs?
These are explicitly covered in the alterNEP:
https://gist.github.com/1056379/
Best,
Matthew
More information about the NumPy-Discussion
mailing list