[Numpy-discussion] NA/Missing Data Conference Call Summary

Wed Jul 6 17:29:00 EDT 2011

On 07/06/2011 03:37 PM, Pierre GM wrote:
> On Jul 6, 2011, at 10:11 PM, Bruce Southey wrote:
>
>> On 07/06/2011 02:38 PM, Christopher Jordan-Squire wrote:
>>>
>>> On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker<Chris.Barker at noaa.gov>  wrote:
>>> Christopher Jordan-Squire wrote:
>>>> If we follow those rules for IGNORE for all computations, we sometimes
>>>> get some weird output. For example:
>>>> [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix
>>>> multiply and not * with broadcasting.) Or should that sort of operation
>>>> through an error?
>>> That should throw an error -- matrix computation is heavily influenced
>>> by the shape and size of matrices, so I think IGNORES really don't make
>>> sense there.
>>>
>>>
>>>
>>> If the IGNORES don't make sense in basic numpy computations then I'm kinda confused why they'd be included at the numpy core level.
>>>
>>>
>>> Nathaniel Smith wrote:
>>>> It's exactly this transparency that worries Matthew and me -- we feel
>>>> that the alterNEP preserves it, and the NEP attempts to erase it. In
>>>> the NEP, there are two totally different underlying data structures,
>>>> but this difference is blurred at the Python level. The idea is that
>>>> you shouldn't have to think about which you have, but if you work with
>>>> C/Fortran, then of course you do have to be constantly aware of the
>>>> underlying implementation anyway.
>>> I don't think this bothers me -- I think it's analogous to things in
>>> numpy like Fortran order and non-contiguous arrays -- you can ignore all
>>> that when working in pure python when performance isn't critical, but
>>> you need a deeper understanding if you want to work with the data in C
>>> or Fortran or to tune performance in python.
>>>
>>> So as long as there is an API to query and control how things work, I
>>> like that it's hidden from simple python code.
>>>
>>> -Chris
>>>
>>>
>>>
>>> I'm similarly not too concerned about it. Performance seems finicky when you're dealing with missing data, since a lot of arrays will likely have to be copied over to other arrays containing only complete data before being handed over to BLAS. My primary concern is that the np.NA stuff 'just works'. Especially since I've never run into use cases in statistics where the difference between IGNORE and NA mattered.
>>>
>>>
>> Exactly!
>> I have not been able to think of an real example where that difference matters as the calculations are only on the 'valid' (ie non-missing and non-masked) values.
> In practice, they could be treated the same way (ie, skipped). However, they are conceptually different and one may wish to keep this difference of information around (between NAs you didn't have and IGNOREs you just dropped temporarily.
>
>
> _______________________________________________
I have yet to see these as *conceptually different* in any of the 
arguments given.

Separate NAs or IGNORES or any number of missing value codes just 
requires use to avoid 'unmasking' those missing value codes in your 
array as, I presume like masked arrays, you need some placeholder values.

Bruce