[Numpy-discussion] feedback request: proposal to add masks to the core ndarray

Fri Jun 24 13:06:57 EDT 2011

On Fri, Jun 24, 2011 at 12:33 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:
> On Thu, Jun 23, 2011 at 8:32 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>
>> On Thu, Jun 23, 2011 at 5:21 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:
>> > On Thu, Jun 23, 2011 at 7:00 PM, Nathaniel Smith <njs at pobox.com> wrote:
>> >> It's should also be possible to accomplish a general solution at the
>> >> dtype level. We could have a 'dtype factory' used like:
>> >>  np.zeros(10, dtype=np.maybe(float))
>> >> where np.maybe(x) returns a new dtype whose storage size is x.itemsize
>> >> + 1, where the extra byte is used to store missingness information.
>> >> (There might be some annoying alignment issues to deal with.) Then for
>> >> each ufunc we define a handler for the maybe dtype (or add a
>> >> special-case to the ufunc dispatch machinery) that checks the
>> >> missingness value and then dispatches to the ordinary ufunc handler
>> >> for the wrapped dtype.
>> >
>> > The 'dtype factory' idea builds on the way I've structured datetime as a
>> > parameterized type, but the thing that kills it for me is the alignment
>> > problems of 'x.itemsize + 1'. Having the mask in a separate memory block
>> > is
>> > a lot better than having to store 16 bytes for an 8-byte int to preserve
>> > the
>> > alignment.
>>
>> Hmm. I'm not convinced that this is the best approach either, but let
>> me play devil's advocate.
>>
>> The disadvantage of this approach is that masked arrays would
>> effectively have a 100% memory overhead over regular arrays, as
>> opposed to the "shadow mask" approach where the memory overhead is
>> 12.5%--100% depending on the size of objects being stored. Probably
>> the most common case is arrays of doubles, in which case it's 100%
>> versus 12.5%. So that sucks.
>>
>> But on the other hand, we gain:
>>  -- simpler implementation: no need to be checking and tracking the
>> mask buffer everywhere. The needed infrastructure is already built in.
>
> I don't believe this is true. The dtype mechanism would need a lot of work
> to build that needed infrastructure first. The analysis I've done so far
> indicates the masked approach will give a simpler/cleaner implementation.
>
>>
>>  -- simpler conceptually: we already have the dtype concept, it's a
>> very powerful and we use it for all sorts of things; using it here too
>> plays to our strengths. We already know what a numpy scalar is and how
>> it works. Everyone already understands how assigning a value to an
>> element of an array works, how it interacts with broadcasting, etc.,
>> etc., and in this model, that's all a missing value is -- just another
>> value.
>
> From Python, this aspect of things would be virtually identical between the
> two mechanisms. The dtype approach would require more coding and overhead
> where you have to create copies of your data to convert it into the
> parameterized "NA[int32]" dtype, versus with the masked approach where you
> say x.flags.hasmask = True or something like that without copying the data.
>
>>
>>  -- it composes better with existing functionality: for example,
>> someone mentioned the distinction between a missing field inside a
>> record versus a missing record. In this model, that would just be the
>> difference between dtype([("x", maybe(float))]) and maybe(dtype([("x",
>> float)])).
>
> Indeed, the difference between an "NA[:x:f4, :y:f4]" versus ":x:NA[f4],
> :y:NA[f4]" can't be expressed the way I've designed the mask functionality.
> (Note, this struct dtype string representation isn't actually supported in
> NumPy.)
>>
>> Optimization is important and all, but so is simplicity and
>> robustness. That's why we're using Python in the first place :-).
>>
>> If we think that the memory overhead for floating point types is too
>> high, it would be easy to add a special case where maybe(float) used a
>> distinguished NaN instead of a separate boolean. The extra complexity
>> would be isolated to the 'maybe' dtype's inner loop functions, and
>> transparent to the Python level. (Implementing a similar optimization
>> for the masking approach would be really nasty.) This would change the
>> overhead comparison to 0% versus 12.5% in favor of the dtype approach.
>
> Yeah, there would be no such optimization for the masked approach. If
> someone really wants this, they are not precluded from also implementing
> their own "nafloat" dtype which operates independently of the masking
> mechanism.
> -Mark
>
>>
>> -- Nathaniel
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>

I don't have enough time to engage in this discussion as I'd like but
I'll give my input.

I've spent a very  amount of time in pandas trying to craft a sensible
and performant missing-data-handling solution giving existing tools
which does not get in the user's way and which also works for
non-floating point data. The result works but isn't completely 100%
satisfactory, and I went by the Zen of Python in that "practicality
beats purity". If anyone's interested, have a pore through the pandas
unit tests for lots of exceptional cases and examples.

About 38 months ago when I started writing the library now called
pandas I examined numpy.ma and friends and decided that

a) the performance overhead for floating point data was not acceptable
b) numpy.ma does too much for the needs of financial applications,
say, or in mimicing R's NA functionality (part of why perf suffers)
c) masked arrays are difficult (imho) for non-expert users to use
effectively. In my experience, it gets in your way, and subclassing is
largely to blame for this (along with the mask field and the myriad
mask-related functions). It's very "complete" from a technical purity
/ computer science-y standpoint by practicality is traded off (re: Zen
of Python). In R many functions have a flag to handle NA's like na.rm
and there is the is.na function, along with a few other NA-handling
functions, and that's it. I wasn't willing to teach my colleagues /
users of pandas how to use masked arrays (or scikits.timeseries,
because of numpy.ma reliance) for this reason. I believe that this has
overall been the right decision.

So whatever solution you come up with, you need to dogfood it if
possible with users who are only at a beginning-to-intermediate level
of NumPy or Python expertise. Does it get in the way? Does it require
constant tinkering with masks (if there is a boolean mask versus a
special NA value)? Intuitive? Hopefully I can take whatever result
comes of this development effort and change pandas to be implemented
on top of it without changing the existing API / behavior in any
significant ways. If I cannot, I will be (very, very) sad.

(I don't mean to be overly critical of numpy.ma-- I just care about
solving problems and making the tools as easy-to-use and intuitive as
possible.)

- Wes