[Numpy-discussion] NA masks in the next numpy release?

Lluís xscript at gmx.net
Fri Oct 28 15:15:01 EDT 2011


I haven't actually tested the code, but AFAIK the following is a short overview
with examples of how the two orthogonal feature axis (ABSENT/IGNORE and
PROPAGATE/SKIP) are related and how it all is supposed to work.

I have never talked to Mark or anybody else in this list (that is, outside of
this list), so I may well be mistaken. Thus, sorry if there are any inaccuracies
and/or if you are already aware of what I'm describing here.

So please tell me if this has helped clarify why I (and I hope others) think the
implementation mechanism is independent of the semantics.


Lluis



ABSENT vs IGNORE
================

Travis Oliphant writes:
> As I mentioned.   I find the ability to separate an ABSENT idea from an IGNORED idea convincing.    In other words, I think distinguishing between masks
> and bit-patterns is not just an implementation detail, but provides a useful concept for multiple use-cases.

I think it's an implementation detail as long as you have two clear ways of
separating them.

Summarizing: let's forget for a moment that "mask" has a meaning in english:
             - "maskna" corresponds to ABSENT
             - "ownmaskna" corresponds to IGNORED

The problem here is that of the two implementation mechanisms (masks and
bitpatterns), only the first can provide both semantics.


Let's start with an array that already supports NAs:

In [1]: a = np.array([1, 2, 3], maskna = True)



ABSENT (destructive NA assignment)
----------------------------------

Once you assign NA, even if you're using NA masks, the value seems to be lost
forever (i.e., the assignment is destructive regardless of the value):

In [2]: b = a.view()
In [3]: c = a.view(maskna = True)
In [4]: b[0] = np.NA
In [5]: a
Out[5]: array([NA, 2, 3])
In [6]: b
Out[6]: array([NA, 2, 3])
In [7]: c
Out[7]: array([NA, 2, 3])


This is the default behaviour, and is probably what the regular user expects by
what has been learned from previous uses of the "view" method.

Note that here "maskna" acts as an idempotent operation. Once an array has the
"maskna" property, all its views will transitively (and destructively) use it.

Also note that an array copy will make a copy of both "regular" data and NA
values, as expected.



IGNORED (non-destructive NA assignment)
---------------------------------------

But you can also have non-destructuve NA assignments, although *only* if you
explicitly (and thus purposefully) ask for it -> ownmaskna

In [8]: b = a.view(ownmaskna = True)
In [9]: b[1] = np.NA
In [10]: a
Out[10]: array([NA, 2, 3])
In [11]: b
Out[11]: array([NA, NA, 3])
In [12]: a[2] = np.NA
In [13]: a
Out[13]: array([NA, 2, NA])
In [14]: b
Out[14]: array([NA, NA, 3])


The mask is a copy:

In [15]: a[0] = 1
In [16]: a
Out[16]: array([1, 2, 3], maskna = True)
In [17]: b
Out[17]: array([NA, NA, 3])


But the data itself is not (aka, non-NA values are *always* destructive, but I
think this is out of the scope of this discussion):

In [17]: a[0] = -10
In [18]: a[2] = -30
In [19]: a
Out[19]: array([-10, 2, -30], maskna = True)
In [20]: b
Out[20]: array([NA, NA, -30])



The dark corner
---------------

The only potential misunderstanding can be the creation of a NA-masked array
from a "regular" array.

This is precisely why I put this case at the end, as it seems to break the
intuition some people have about assignment being always destructive (unless you
explicitly ask for IGNORED, which is not the case):

In [21]: a = np.array([1, 2, 3])
Out[21]: array([1, 2, 3])
In [22]: b = a.view(maskna = True)
In [23]: b[0] = np.NA
In [24]: a
Out[24]: array([1, 2, 3])
In [25]: b
Out[25]: array([NA, 2, 3])


This is in fact a corner case, and there is no obvious (and efficient!) way to
handle it. As "a" is just a "regular" array, and has no support for any type of
NA values (neither masks nor bit-patterns), assignments to any of its views
cannot, in any case, be destructive.

Note that the previous holds true because it currently is a design decision to
forbid the in-flight conversion from "regular" to "NA-enabled" arrays.


In fact I forgot that, when reading the docs in [1], I thought that a slight
change could make it all feel more consistent: the view of a regular array can
have NA values only if "ownmaskna" is used (IGNORED/non-destructive NA
assignments), and will give an error if "maskna" is used in entry number 19.

[1] http://docs.scipy.org/doc/numpy/reference/arrays.maskna.html#creating-na-masked-views



PROPAGATE vs SKIP
=================

I've also read some comments regarding this. Maybe I didn't explain myself
correctly in previous mails, or maybe I just misunderstood other people's mails
(which might not be about this at all).


PROPAGATE
---------

All ufuncs in ndarray propagate NA values.

Note that ABSENT (destructive NA-assignment) is also a default, so we could say
that the default is R-like behaviour (AFAIK).


SKIP
----

You have a different array type (let's call it skip_array), where all ufuncs do
*not* propagate NA values.


Middle-ground
-------------

For the sake of code maintainability (and the specific needs one might have on a
per-ufunc basis), in fact you only have one type of ndarray that supports both
PROPAGATE and SKIP with the very same NA values.

This can be controlled on a per-ufunc basis through the "skipna" argument that
is present on all ufuncs, so that ndarray defaults to "skipna = False" and
skip_array defaults to "skipna = True".

The latter is done by simply defining an ndarray subclass that provides an ufunc
wrapper like this (fake code):

        class skip_array (np.ndarray):
              ...
              def __ufunc_wrap__ (ufunc, *args, **kwargs):
                  kwargs["skipna"] = True
                  return ufunc(*args, **kwargs)

There are other ways of doing it, but IMHO how it can be done doesn't matter
right now.

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth



More information about the NumPy-Discussion mailing list