[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Fri Jun 24 20:24:23 EDT 2011

On Fri, Jun 24, 2011 at 6:11 PM, Wes McKinney <wesmckinn at gmail.com> wrote:

> On Fri, Jun 24, 2011 at 8:02 PM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
> >
> >
> > On Fri, Jun 24, 2011 at 5:22 PM, Wes McKinney <wesmckinn at gmail.com>
> wrote:
> >>
> >> On Fri, Jun 24, 2011 at 7:10 PM, Charles R Harris
> >> <charlesr.harris at gmail.com> wrote:
> >> >
> >> >
> >> > On Fri, Jun 24, 2011 at 4:21 PM, Matthew Brett <
> matthew.brett at gmail.com>
> >> > wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> On Fri, Jun 24, 2011 at 10:09 PM, Benjamin Root <ben.root at ou.edu>
> >> >> wrote:
> >> >> ...
> >> >> > Again, there are pros and cons either way and I see them very
> >> >> > orthogonal
> >> >> > and
> >> >> > complementary.
> >> >>
> >> >> That may be true, but I imagine only one of them will be implemented.
> >> >>
> >> >> @Mark - I don't have a clear idea whether you consider the nafloat64
> >> >> option to be still in play as the first thing to be implemented
> >> >> (before array.mask).   If it is, what kind of thing would persuade
> you
> >> >> either way?
> >> >>
> >> >
> >> > Mark can speak for himself,  but I think things are tending towards
> >> > masks.
> >> > They have the advantage of one implementation for all data types,
> >> > current
> >> > and future, and they are more flexible since the masked data can be
> >> > actual
> >> > valid data that you just choose to ignore for experimental  reasons.
> >> >
> >> > What might be helpful is a routine to import/export R files, but that
> >> > shouldn't be to difficult to implement.
> >> >
> >> > Chuck
> >> >
> >> >
> >> > _______________________________________________
> >> > NumPy-Discussion mailing list
> >> > NumPy-Discussion at scipy.org
> >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >> >
> >> >
> >>
> >> Perhaps we should make a wiki page someplace summarizing pros and cons
> >> of the various implementation approaches? I worry very seriously about
> >> adding API functions relating to masks rather than having special NA
> >> values which propagate in algorithms. The question is: will Joe Blow
> >> Former R user have to understand what is the mask and how to work with
> >> it? If the answer is yes we have a problem. If it can be completely
> >> hidden as an implementation detail, that's great. In R NAs are just
> >> sort of inherent-- they propagate you deal with them when you have to
> >> via na.rm flag in functions or is.na.
> >>
> >
> > Well, I think both of those can be pretty transparent. Could you
> illustrate
> > some typical R usage, to wit.
> >
> > 1) setting a value to na
> > 2) checking a value for na
> >
> > Other things are problematic, like checking for integer overflow. For
> safety
> > that would be desireable, for speed not. I think that is a separate
> question
> > however. In any case, if we do check such things we should be able to set
> > the corresponding mask value in the loop, and I suppose that is the sort
> of
> > thing you want.
> >
> > Chuck
> >
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
>
> I think anyone making decisions about this needs to have a pretty good
> understanding of what R does. So here's some examples but you guys
> really need to spend some time with R if you have not already
>
> arr <- rnorm(20)
> arr
>  [1]  1.341960278  0.757033314 -0.910468762 -0.475811935 -0.007973053
>  [6]  1.618201117 -0.965747088  0.386811224  0.229158237  0.987050613
> [11]  1.293453170 -2.432399045 -0.247593481 -0.639769586 -0.464996583
> [16]  0.720181047  0.846607030  0.486173088 -0.911247626  0.370326788
> arr[5:10] = NA
> arr
>  [1]  1.3419603  0.7570333 -0.9104688 -0.4758119         NA         NA
>  [7]         NA         NA         NA         NA  1.2934532 -2.4323990
> [13] -0.2475935 -0.6397696 -0.4649966  0.7201810  0.8466070  0.4861731
> [19] -0.9112476  0.3703268
> is.na(arr)
>  [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
> FALSE
> [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> mean(arr)
> [1] NA
> mean(arr, na.rm=T)
> [1] -0.01903945
>
> arr + rnorm(20)
>  [1]  2.081580297  0.505050028 -0.696287035 -1.280323279           NA
>  [6]           NA           NA           NA           NA           NA
> [11]  2.166078369 -1.445271291  0.764894624  0.795890929  0.549621207
> [16]  0.005215596 -0.170001426  0.712335355 -0.919671745 -0.617099818
>
> and obviously this is OK too:
>
> arr <- rep('wes', 10)
> arr[5:7] <- NA
> is.na(arr)
>  [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE
>
> note, NA gets excluded from categorical variables (factors):
> as.factor(arr)
>  [1] wes  wes  wes  wes  <NA> <NA> <NA> wes  wes  wes
> Levels: wes
>
> e.g. groupby with NA:
>
> > tapply(rnorm(10), arr, mean)
>       wes
> -0.5271853
>

I think those are all doable. The main concerns I have at the moment are:

1) Tracking things like integer overflow, yes, no.
2) Memory. I suppose masks could be packed into bits if it came to that.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110624/c96585d5/attachment.html>