[Numpy-discussion] NA masks in the next numpy release?

Fri Oct 28 13:49:31 EDT 2011

Hi,

On Fri, Oct 28, 2011 at 9:21 AM, Chris.Barker <Chris.Barker at noaa.gov> wrote:
> On 10/27/11 7:51 PM, Travis Oliphant wrote:
>> As I mentioned. I find the ability to separate an ABSENT idea from an
>> IGNORED idea convincing. In other words, I think distinguishing between
>> masks and bit-patterns is not just an implementation detail, but
>> provides a useful concept for multiple use-cases.
>
> Exactly -- while one can implement ABSENT with a mask, one can not
> implement IGNORE with a bit-pattern. So it is not an implementation detail.
>
> I also think bit-patterns are a bit of a dead end:
>
> - there is only a standard for one data type family: i.e. NaN for ieee
> float types
>
> - So we would be coming up with our own standard (or adopting an
> existing one, but I don't think there is one widely supported) for other
> types. This means:
>   1) a lot of work to do

Largest possible negative integer for ints / largest integer for uints
/ not allowed for bool?

>   2) a binary format incompatible with other code, compilers, etc. This
> is a BIG deal -- a major strength of numpy is that it serves as a
> wrapper for a data block that is compatible with C, Fortran or whatever
> code -- special bit patterns would make this a lot harder.

Extension code is going to get harder.   At the moment, as far as I
understand it, our extension code can receive a masked array and
(without an explicit check from us) ignore the mask and process all
the values.  Then you're in the unfortunate situation of caring what's
under the mask.

Bitpatterns would - I imagine - be safer in that respect in that they
would be new dtypes and thus extension code would by default reject
them as unknown.

> We also talked about the fact that a 8-bit mask provides the ability to
> carry other information in the mask -- not jsut "missing" or "ignored",
> but a handful of other possible reasons for masking. I think that has a
> lot of possibilities.
>
> On 10/28/11 2:11 AM, Stéfan van der Walt wrote:
>> Another data point:  I've been spending some time on scikits-image
>> recently, and although masked values would be highly useful in that
>> context, the cost of doubling memory use (for uint8 images, e.g.) is
>> too high.
>
>> 2) that we make a concerted effort to implement the bitmask mode of
>> operation as soon as possible.
>
> I wonder if that might be handled as a scikits-image extension, rather
> than core numpy?

I think Stefan and Nathaniel and Gary Strangman and others are saying
we don't want to pay the price of a large memory hike for masking.   I
suspect that Nathaniel is right, and that a large majority of those of
us who want 'missing data' functionality, also want what we've called
ABSENT missing values, and care about memory.

See you,

Matthew