[Numpy-discussion] NA/Missing Data Conference Call Summary

Tue Jul 5 19:46:27 EDT 2011

Here's a short-ish summary of the topics discussed in the conference call
this afternoon. WARNING: I try to give examples for everything discussed to
make it as concrete as possible. However, most of the examples were not
explicitly discussed during the conference. I apologize in advance if I
mischaracterize anyone's arguments, and please jump in to correct me if I
did.

Participants: Travis Oliphant, Mark Wiebe, Matthew Brett, Nathaniel Smith,
Pierre GM, Ben Root, Chuck Harris, Wes McKinney, Chris Jordan-Squire

First, areas of broad agreement:
*There should be more functionality for missing data
*There should be dtypes which support missing data ('parameterized dtypes'
in the current NEP)
*Adding a 'where' semantic to ufuncs
*Have the same data with different sets of missing elements in different
views
*Easy for non-expert numpy users

Since we only have Mark is only around Austin until early August, there's
also broad agreement that we need to get something done quickly. However,
the numpy community (and Travis in particular) are balancing this against
the possibility of a sub-optimal solution which can't be taken back.

BIT PATTERN & MASK IMPLEMENTATIONS FOR NA
------------------------------------------------------------------------------------------

The current NEP proposes both mask and bit pattern implementations for
missing data. I use the terms bit pattern and parameterized dtype
interchangeably, since the parameterized dtype will use a bit pattern for
its implementation. The two implementations will support the same
functionality with respect to NA, and the implementation details will be
largely invisible to the user. Their differences are in the 'extra' features
each supports.

Two common questions were:
1. Why make two implementations of missing data: one with masks and the
other with parameterized dtypes?
2. Why does the implementation using masks have higher priority?

The answers are:
1.  The mask implementation is more general and easier to implement and
maintain.  The bit pattern implementation saves memory, makes
interoperability easier, and makes ABI (Application Binary Interface)
compatibility easier. Since each has different strengths, the argument is
both should be implemented.
2. The implementation for the parameterized dtypes will rely on the
implementation using a mask.

NA VS. IGNORE
---------------------------------

A lot of discussion centered on IGNORE vs. NA types. We take IGNORE in aNEP
sense and NA in  NEP sense. With NA, there is a clear notion of how NA
propagates through all basic numpy operations.  (e.g., 3+NA=NA and log(NA) =
NA, while NA | True = True.) IGNORE is separate from NA, with different
interpretations depending on the use case.

IGNORE could mean:
1. Data that is being temporarily ignored. e.g., a possible outlier that is
temporarily being removed from consideration.
2. Data that cannot exist. e.g., a matrix representing a grid of water
depths for a lake. Since the lake isn't square, some entries will represent
land, and so depth will be a meaningless concept for those entries.
3. Using IGNORE to signal a jagged array. e.g., [ [1, 2, IGNORE], [IGNORE,
3, 4] ] should behave exactly the same as [ [1 , 2] , [3 , 4] ]. Though this
leaves open how [1, 2, IGNORE] + [3 , 4] should behave.

Because of these different uses of IGNORE, it doesn't have as clear a
theoretical interpretation as NA. (For instance, what is IGNORE+3, IGNORE*3,
or IGNORE | True?)

But several of the discussants thought the use cases for IGNORE were very
compelling. Specifically, they wanted to be able to use IGNORE's and NA's
simultaneously while still being able to differentiate between them. So, for
example, being able to designate some data as IGNORE while still able to
determine which data was NA but not IGNORE. The current NEP does not allow
for this directly. Although in some cases it can be indirectly done via
views. (By taking a view of the original data, expanding the values which
are considered NA in the view, and then comparing with the original data to
see if the NA is in the original or not.) Since both are possible in this
sense, Mark's NEP makes it so IGNORE is allowed but isn't the default.

Another important point from the current NEP is that not being able to
access values considered missing, even if the implementation of missingness
is via a mask, is a feature and not a bug. It is a feature because if the
data is missing then, conceptually, neither the user nor any function the
user calls should be able to obtain that data. This is precisely why the
indirect route, via views of the original data, is required to access data
that a different view says is missing.

The current NEP treats all NA's the same. The reasoning is that, regardless
of where the NA originated, the functions the numpy array is fed in to will
either ignore all NA's or propagate them (i.e. not ignore them). These two
different behaviors are chosen when passed into a ufunc by setting the
skipna ufunc parameter to True or False. Since the NA's are treated the
same, their source is irrelevant. Though this could be argued against if
there are compelling cases where the IGNORE and NA are treated differently.

A possible solution to the above desires for an IGNORE notion of missingness
is to allow for multiple types of missing values. For example, the mask
underlying the missing data could have int types, and different ints mean
different missing. E.g. 0 is present, 1 is NA, 2 is IGNORE. However, this
was only discussed briefly at the end of the conference call, and should be
discussed further.

HOW DOES THIS RELATE TO THE CURRENT MASKED ARRAY?
----------------------------------------------------------------------------------------------------

Everyone seems to agree they'd love it if this could encompass all current
use cases of the numpy.ma arrays, so numpy.ma arrays could be deprecated.
(However they wouldn't be eliminated for several years, even in the most
optimistic scenarios.)

IMPLEMENTATION DETAILS
-----------------------------------------------------

*Under the hood, the parameterized dtypes will use buffered masks when
performing operations. This can be a source of confusion when discussing
their behavior, since there is no true mask, hence no extra memory, but a
mask is created on the fly.

*The iterator will be given a new 'masked' mode, triggered by a flag, which
will use or ignore data based on a boolean array.

*Currently won't allow shared masks. But Pierre GM suggests that's just as
well since they easily lead to buggy code.

I hope this summary roughly captures what was said. Please chime in with
additional comments/corrections.

-Chris Jordan-Squire
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110705/dd723098/attachment.html>