[Numpy-discussion] using NaN, INT_MIN etc in ndarray instead of a masked array

Tue Apr 18 07:06:22 EDT 2006

On 4/18/06, Travis Oliphant <oliphant.travis at ieee.org> wrote:
> Michael Sorich wrote:
> ...
> > Is it possible to implement masked values using these special bit
> > patterns in the ndarray instead of using a separate MA class? If so
> > has there been any thought as to whether this may be the better
> > option. I think it would be preferable if the ability to handle masked
> > data was available in the standard array class (ndarray), as this
> > would increase the likelihood that functions built for numeric arrays
> > will handle masked values well. It seems that ndarray already has
> > decent support for nans (isnan() returns the equivalent of a boolean
> > mask array), indicating that such an approach may be acceptable. How
> > difficult is it to generalise the concept to other data types (int,
> > string, bool)?
> >
> I don't think the approach can be generalized at all.   It would only
> work with floating-point values and therefore is not particularly exciting.
>
Not true. R supports "NA" for all its types except raw bytes.
For example:

> x<-logical(5)
> x
[1] FALSE FALSE FALSE FALSE FALSE
> x[1:2]=NA
> !x
[1]   NA   NA TRUE TRUE TRUE

> I think ultimately, making masked arrays a C-based sub-class is where
> masked array should go.  For now the Python-based class is a good
> environment for developing the ideas behind how to preserve masked
> arrays through other functions if it is possible.
>
I've voiced my opposition to subclassing before.  Here I believe it is
more appropriate to have an add-on module that installs alternative
math functions. Having two classes in the same application that a
subtly different in the corner cases is already a problem with
ma.array vs. ndarray, adding the third class will only make things
worse.

> It seems that masked arrays must do things quite differently than other
> arrays on certain applications, and I'm not altogether clear on how to
> support them in all the NumPy code.  Because masked arrays are not used
> by everybody who uses NumPy arrays, it should be a separate sub-class.
>
As far as I understand, people who don't use MA don't deal with
missing values. For this category of users there will be no visible
effect no matter how missing values are treated as long as in the
absence of missing values, normal rules apply. Yes, many functions
must treat missing values differently, but the same is true for NaNs. 
NumPy allows floating point arrays to have nans, but there is no real
support beyong what happened to work at the OS level.

For example:

>>> sort([5,nan,3,2])
array([ 5.        ,         nan,  2.        ,  3.        ])

Also, what is the justification for

>>> int_(nan)
0
?

> Ultimately, I hope we will get the basic array object into Python (what
> Tim was calling the super array) before 2.6

As far as I understand, that object will not come with arithmetic
rules or math functions.  Therefore, I don't see how this is relevant
to the present discussion.