[Numpy-discussion] missing data discussion round 2

Gary Strangman strang at nmr.mgh.harvard.edu
Thu Jun 30 12:04:12 EDT 2011


>       Clearly there are some overlaps between what masked arrays are
>       trying to achieve and what Rs NA mechanisms are trying to achieve.
>        Are they really similar enough that they should function using
>       the same API?
> 
> Yes.
>
>       And if so, won't that be confusing?
> 
> No, I don't believe so, any more than NA's in R, NaN's, or Inf's are already
> confusing.

As one who's been silently following (most of) this thread, and a heavy R 
and numpy user, perhaps I should chime in briefly here with a use case. I 
more-or-less always work with partially masked data, like Matthew, but not 
numpy masked arrays because the memory overhead is prohibitive. And, sad 
to say, my experiments don't always go perfectly. I therefore have arrays 
in which there is /both/ (1) data that is simply missing (np.NA?)--it 
never had a value and never will--as well as simultaneously (2) data that 
that is temporarily masked (np.IGNORE? np.MASKED?) where I want to 
mask/unmask different portions for different purposes/analyses. I consider 
these two separate, completely independent issues and I unfortunately 
currently have to kluge a lot to handle this.

Concretely, consider a list of 100,000 observations (rows), with 12 
measures per observation-row (a 100,000 x 12 array). Every now and then, 
sprinkled throughout this array, I have missing values (someone didn't 
answer a question, or a computer failed to record a response, or 
whatever). For some analyses I want to mask the whole row (e.g., 
complete-case analysis), leaving me with array entries that should be 
tagged with all 4 possible labels:

1) not masked, not missing
2) masked, not missing
3) not masked, missing
4) masked, missing

Obviously #4 is "overkill" ... but only until I want to unmask that row. 
At that point, I need to be sure that missing values remain missing when 
unmasked. Can a single API really handle this?

-best
Gary


The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.


More information about the NumPy-Discussion mailing list