[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Mark Wiebe mwwiebe at gmail.com
Wed Jun 29 13:53:09 EDT 2011


On Tue, Jun 28, 2011 at 7:34 AM, Lluís <xscript at gmx.net> wrote:

> Mark Wiebe writes:
> > The design that's forming is a combination of:
>
> > * Solve the missing data problem
> > * My ideas of what a good solution looks like:
> >    * applies to all NumPy dtypes in a fully general way
> >    * high-performance, low overhead where possible
> >    * makes the C-level implementation of NumPy nicer to work with, not
> harder
> >    * easy to use from Python for unskilled programmers
> >    * easy to use more powerful functionality from Python for skilled
> programmers
> >    * satisfies all or most of the needs of the many users of arrays with
> a "missing data" aspect to them
>
> I would add here an efficient mechanism to reinterpret exising data with
> different missing information (no copies of the backing array).
>
> Although I'm not sure whether this requires first-class citizenship or
> not.
>

I'm calling this idea "masking semantics" generally.

> * All the feedback I'm getting from discussions on the list
> [...]
> > I've updated a section "Parameterized Data Type With NA Signal Values"
> > in the NEP with an idea for now an NA bit pattern approach could
> > coexist and work together with the mask-based approach. I think I've
> > solved some of the generality and implementation obstacles, it would
> > be great to get some feedback on that.
>
> Some (obvious) thoughts about it:
>
> * Trivial to store, as the missing property is encoded in the value
>  itself.
> * Third-party (non-Python) code needs some interface to interpret these
>  without having to know the implementation details (although the
>  interface is rather trivial).
> * Data marked as missing loses its original value.
> * Reinterpreting the same data (memory buffer) with different missing
>  information requires either memory copies or separate mask arrays (see
>  above)
>
> So, while it (data types with NA signal values) has its advantages on a
> simpler interaction with 3rd party code and during long-term storage,
> masks will still be needed.
>
> I think that deciding on the value of NA signal values boils down to
> this question: should 3rd party code be able to interpret missing data
> information stored in the separate mask array?
>

I'm tossing around some variations of ideas using the iterator to provide a
buffered mask-based interface that works uniformly with both masked arrays
and NA dtypes. This way 3rd party C code only needs to implement one missing
data mechanism to fully support both of NumPy's missing data mechanisms.

-Mark


> If the answer is no, then 3rd party code should be given a copy of the
> data where the masked array is merged with the ndarray data buffer
> (assuming the original ndarray had a masked array before passing it to
> the 3rd party code). As by definition (?) the ndarray with a mask must
> retain the original data, the result of the 3rd party code must be
> translated back into an ndarray + mask.
>
> If the answer is yes, then I think the NA signal values just add
> unnecessary complexity, as the 3rd party code will already need to use
> some numpy-specific API to handle missing data through the ndarray
> buffer + mask buffer. This reminds me that if 3rd party were to use the
> new iterator interface, the interface could be twisted in a way that it
> returns only the non-missing parts. For the sake of performance, this
> could be optional, so that the default behaviour is to just iterate
> through non-missing data but an option can be used to iterate over all
> data, and leave missing data handling up to the 3rd party code.
>
>
> My 2 cents,
>   Lluis
>
> --
>  "And it's much the same thing with knowledge, for whenever you learn
>  something new, the whole world becomes that much richer."
>  -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
>  Tollbooth
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110629/6187309e/attachment.html>


More information about the NumPy-Discussion mailing list