[Numpy-discussion] HPC missing data - was: NA/Missing Data Conference Call Summary

Wed Jul 6 14:49:48 EDT 2011

On 07/06/2011 04:47 PM, Matthew Brett wrote:
> Hi,
>
> On Wed, Jul 6, 2011 at 2:12 PM, Dag Sverre Seljebotn
> <d.s.seljebotn at astro.uio.no>  wrote:
>> I just commented on the "prevent direct API access to the masking array"
>> part -- I'm hoping direct access by external code to the underlying
>> implementation details will be allowed, at some point.
>>
>> What I'm saying is that Mark's proposal is more flexible. Say for the
>> sake of the argument that I have two codes I need to interface with:
>>
>>   - Library A is written in Fortran and uses a seperate (explicit) mask
>> array for NA
>>
>>   - Library B runs on a GPU and uses a bit pattern for NA
>>
>> Mark's proposal then comes closer to allowing me to wrap both codes
>> using NumPy, since it supports both implementation mechanisms. Sure, it
>> would need a seperate NEP down the road to extend it, but it goes in the
>> right direction for this to happen.
>
> I'm sorry - honestly - maybe it's because I've just had lunch, but I
> think I am not understanding something.   When you say "Mark's
> proposal is more flexible" - more flexible than what?  I think we
> agree that:
>
> * NA bitpatterns are good to have
> * masks are good to have
>
> and the discussion is about:
>
> * should it be possible to distinguish between bitpatterns (NAs) and
> masks (IGNORE).

I guess I just don't agree with these definitions. There's (NA, IGNORE), 
and there's (bitpatterns, masks); these are in principle orthogonal. It 
is possible (and perhaps reasonable) to hard-wire them they way you say 
-- that may be more obvious, user-friendly, etc., but it is not more 
flexible.

Both Mark and Chuck have explicitly supported having many different NA 
types down the road (thread: "An NA compromise idea -- many-NA"). So the 
main difference to me seems to be that you want to hard-wire the NA type 
and the representation in a specific configuration.

I may be missing something though.

> Are you saying that making it not-possible to distinguish - at the
> numpy level, is more flexible?

I'm OK with the "common" ways of accessing data to not distinguish, as 
long as there's some poweruser way around it. Just like strides -- you 
index a strided array just like a contiguous array, but you can peek 
inside into the implementation if you want.

Dag Sverre