[Numpy-discussion] NA masks in the next numpy release?

Mon Oct 24 13:12:15 EDT 2011

On Mon, Oct 24, 2011 at 10:54 AM, Charles R Harris
<charlesr.harris at gmail.com> wrote:
>
>
> On Mon, Oct 24, 2011 at 8:40 AM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
>>
>>
>> On Sun, Oct 23, 2011 at 11:23 PM, Wes McKinney <wesmckinn at gmail.com>
>> wrote:
>>>
>>> On Sun, Oct 23, 2011 at 8:07 PM, Eric Firing <efiring at hawaii.edu> wrote:
>>> > On 10/23/2011 12:34 PM, Nathaniel Smith wrote:
>>> >
>>> >> like. And in this case I do think we can come up with an API that will
>>> >> make everyone happy, but that Mark's current API probably can't be
>>> >> incrementally evolved to become that API.)
>>> >>
>>> >
>>> > No one could object to coming up with an API that makes everyone happy,
>>> > provided that it actually gets coded up, tested, and is found to be
>>> > fast
>>> > and maintainable.  When you say the API probably can't be evolved, do
>>> > you mean that the underlying implementation also has to be redone?  And
>>> > if so, who will do it, and when?
>>> >
>>> > Eric
>>> > _______________________________________________
>>> > NumPy-Discussion mailing list
>>> > NumPy-Discussion at scipy.org
>>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>> >
>>>
>>> I personally am a bit apprehensive as I am worried about the masked
>>> array abstraction "leaking" through to users of pandas, something
>>> which I simply will not accept (why I decided against using numpy.ma
>>> early on, that + performance problems). Basically if having an
>>> understanding of masked arrays is a prerequisite for using pandas, the
>>> whole thing is DOA to me as it undermines the usability arguments I've
>>> been making about switching to Python (from R) for data analysis and
>>> statistical computing.
>>
>> The missing data functionality looks far more like R than numpy.ma.
>>
>
> For instance
>
> In [8]: a = arange(5, maskna=1)
>
> In [9]: a[2] = np.NA
>
> In [10]: a.mean()
> Out[10]: NA(dtype='float64')
>
> In [11]: a.mean(skipna=1)
> Out[11]: 2.0
>
> In [12]: a = arange(5)
>
> In [13]: b = a.view(maskna=1)
>
> In [14]: a.mean()
> Out[14]: 2.0
>
> In [15]: b[2] = np.NA
>
> In [16]: b.mean()
> Out[16]: NA(dtype='float64')
>
> In [17]: b.mean(skipna=1)
> Out[17]: 2.0
>
> Chuck
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>

I don't really agree with you.

some sample R code

> arr <- rnorm(10)
> arr[5:8] <- NA
> arr
 [1]  0.6451460 -1.1285552  0.6869828  0.4018868         NA         NA
 [7]         NA         NA  0.3322803 -1.9201257

In your examples you had to pass maskna=True-- I suppose that my only
recourse would be to make sure that every array inside a DataFrame,
for example, has maskna=True set. I'll have to look in more detail and
see if it's feasible/desirable. There's a memory cost to pay, but you
can't get the functionality for free. I may just end up sticking with
NaN as it's worked pretty well so far the last few years-- it's an
impure solution but one with reasonably good performance
characteristics in the places that matter.