[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Fri Jun 24 11:14:59 EDT 2011

On Fri, Jun 24, 2011 at 10:07, Laurent Gautier <lgautier at gmail.com> wrote:
> On 2011-06-24 16:43, Robert Kern <robert.kern at gmail.com> wrote:
>>
>> On Fri, Jun 24, 2011 at 09:33, Charles R Harris
>> <charlesr.harris at gmail.com> wrote:
>>>
>>> >
>>> >  On Fri, Jun 24, 2011 at 8:06 AM, Robert Kern<robert.kern at gmail.com>
>>> >  wrote:
>>>>
>>>> >>  The alternative proposal would be to add a few new dtypes that are
>>>> >>  NA-aware. E.g. an nafloat64 would reserve a particular NaN value
>>>> >>  (there are lots of different NaN bit patterns, we'd just reserve
>>>> >> one)
>>>> >>  that would represent NA. An naint32 would probably reserve the most
>>>> >>  negative int32 value (like R does). Using the NA-aware dtypes
>>>> >> signals
>>>> >>  that you are using NA values; there is no need for an additional
>>>> >> flag.
>>>
>>> >
>>> >  Definitely better names than r-int32. Going this way has the advantage
>>> > of
>>> >  reducing the friction between R and numpy, and since R has pretty much
>>> >  become the standard software for statistics that is an important
>>> >  consideration.
>>
>> I would definitely steal their choices of NA value for naint32 and
>> nafloat64. I have reservations about their string NA value (i.e. 'NA')
>> as anyone doing business in North America and other continents may
>> have issues with that....
>
> May be there is not so much need for reservation over the string NA, when
> making the distinction between:
> a- the internal representation of a "missing string" (what is stored in
> memory, and that C-level code would need to be aware of)
> b- the 'external' representation of a missing string (in Python, what would
> be returned by repr() )
> c- what is assumed to be a missing string value when reading from a file.
>
> a/ is not 'NA', c/ should be a parameter in the relevant functions, b/ can
> be configured as a module-level, class-level, or instance-level variable.

In R, a/ happens to be 'NA', unfortunately. :-/

I'm not really sure how they handle datasets that use valid 'NA'
values. Presumably, their input routines allow one to convert such
values to something else such that it can use 'NA'==NA internally.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco