[Numpy-discussion] missing data discussion round 2

Mon Jun 27 18:03:44 EDT 2011

On Mon, Jun 27, 2011 at 5:01 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:
> On Mon, Jun 27, 2011 at 2:59 PM, <josef.pktd at gmail.com> wrote:
>>
>> On Mon, Jun 27, 2011 at 2:24 PM, eat <e.antero.tammi at gmail.com> wrote:
>> >
>> >
>> > On Mon, Jun 27, 2011 at 8:53 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:
>> >>
>> >> On Mon, Jun 27, 2011 at 12:44 PM, eat <e.antero.tammi at gmail.com> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> On Mon, Jun 27, 2011 at 6:55 PM, Mark Wiebe <mwwiebe at gmail.com> wrote:
>> >>>>
>> >>>> First I'd like to thank everyone for all the feedback you're
>> >>>> providing,
>> >>>> clearly this is an important topic to many people, and the discussion
>> >>>> has
>> >>>> helped clarify the ideas for me. I've renamed and updated the NEP,
>> >>>> then
>> >>>> placed it into the master NumPy repository so it has a more permanent
>> >>>> home
>> >>>> here:
>> >>>> https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst
>> >>>> In the NEP, I've tried to address everything that was raised in the
>> >>>> original thread and in Nathaniel's followup 'Concepts' thread. To
>> >>>> deal with
>> >>>> the issue of whether a mask is True or False for a missing value,
>> >>>> I've
>> >>>> removed the 'mask' attribute entirely, except for ufunc-like
>> >>>> functions
>> >>>> np.ismissing and np.isavail which return the two styles of masks.
>> >>>> Here's a
>> >>>> high level summary of how I'm thinking of the topic, and what I will
>> >>>> implement:
>> >>>> Missing Data Abstraction
>> >>>> There appear to be two useful ways to think about missing data that
>> >>>> are
>> >>>> worth supporting.
>> >>>> 1) Unknown yet existing data
>> >>>> 2) Data that doesn't exist
>> >>>> In 1), an NA value causes outputs to become NA except in a small
>> >>>> number
>> >>>> of exceptions such as boolean logic, and in 2), operations treat the
>> >>>> data as
>> >>>> if there were a smaller array without the NA values.
>> >>>> Temporarily Ignoring Data
>> >>>> In some cases, it is useful to flag data as NA temporarily, possibly
>> >>>> in
>> >>>> several different ways, for particular calculations or testing out
>> >>>> different
>> >>>> ways of throwing away outliers. This is independent of the missing
>> >>>> data
>> >>>> abstraction, still requiring a choice of 1) or 2) above.
>> >>>> Implementation Techniques
>> >>>> There are two mechanisms generally used to implement missing data
>> >>>> abstractions,
>> >>>> 1) An NA bit pattern
>> >>>> 2) A mask
>> >>>> I've described a design in the NEP which can include both techniques
>> >>>> using the same interface. The mask approach is strictly more general
>> >>>> than
>> >>>> the NA bit pattern approach, except for a few things like the idea of
>> >>>> supporting the dtype 'NA[f8,InfNan]' which you can read about in the
>> >>>> NEP.
>> >>>> My intention is to implement the mask-based design, and possibly also
>> >>>> implement the NA bit pattern design, but if anything gets cut it will
>> >>>> be the
>> >>>> NA bit patterns.
>> >>>> Thanks again for all your input so far, and thanks in advance for
>> >>>> your
>> >>>> suggestions for improving this new revision of the NEP.
>> >>>
>> >>> A very impressive PEP indeed.
>> >
>> > Hi,
>> >>>
>> >>> However, how would corner cases, like
>> >>>
>> >>> >>> a = np.array([np.NA, np.NA], dtype='f8', masked=True)
>> >>> >>> np.mean(a, skipna=True)
>> >>
>> >> This should be equivalent to removing all the NA values, then calling
>> >> mean, like this:
>> >> >>> b = np.array([], dtype='f8')
>> >> >>> np.mean(b)
>> >>
>> >>
>> >> /home/mwiebe/virtualenvs/dev/lib/python2.7/site-packages/numpy/core/fromnumeric.py:2374:
>> >> RuntimeWarning: invalid value encountered in double_scalars
>> >>   return mean(axis, dtype, out)
>> >> nan
>> >>>
>> >>> >>> np.mean(a)
>> >>
>> >> This would return NA, since NA values are sitting in positions that
>> >> would
>> >> affect the output result.
>> >
>> > OK.
>> >>
>> >>
>> >>>
>> >>> be handled?
>> >>> My concern here is that there always seems to be such corner cases
>> >>> which
>> >>> can only be handled with specific context knowledge. Thus producing
>> >>> 100%
>> >>> generic code to handle 'missing data' is not doable.
>> >>
>> >> Working out the corner cases for the functions that are already in
>> >> numpy
>> >> seems tractable to me, how to or whether to support missing data is
>> >> something the author of each new function will have to consider when
>> >> missing
>> >> data support is in NumPy, but I don't think we can do more than provide
>> >> the
>> >> mechanisms for people to use.
>> >
>> > Sure. I'll ride up with this and wait when I'll have some tangible to
>> > outperform the 'traditional' NaN handling.
>> > - eat
>>
>> Just a question how things would work with the new model.
>> How can you implement the "use" keyword from R's cov (or cor), with
>> minimal data copying
>>
>> I think the basic masked array version would (or does) just assign 0
>> to the missing values calculate the covariance or correlation and then
>> correct with the correct count.
>>
>> ------------
>> cov(x, y = NULL, use = "everything",
>>    method = c("pearson", "kendall", "spearman"))
>>
>> cor(x, y = NULL, use = "everything",
>>     method = c("pearson", "kendall", "spearman"))
>>
>> cov2cor(V)
>>
>> Arguments
>> x   a numeric vector, matrix or data frame.
>>  y  NULL (default) or a vector, matrix or data frame with compatible
>> dimensions to x. The default is equivalent to y = x (but more
>> efficient).
>>  na.rm   logical. Should missing values be removed?
>>
>>  use   an optional character string giving a method for computing
>> covariances in the presence of missing values. This must be (an
>> abbreviation of) one of the strings "everything", "all.obs",
>> "complete.obs", "na.or.complete", or "pairwise.complete.obs".
>> ------------
>>
>> especially I'm interested in the complete.obs (drop any rows that
>> contains a NA) case
>
> I think this is mainly a matter of extending NumPy's equivalent cov function
> with a parameter like this. Implemented in C, I'm sure it could be done with
> minimal copying, I'm not exactly sure how it will have to look implemented
> in Python. Perhaps someone could try it once I have a basic prototype ready
> for testing.

This is just a typical example, going to C doesn't help, whoever is
rewriting scipy.stats.mstats or is writing similar statistical code
will need to do this all the time.

Josef

> -Mark
>
>>
>> Josef
>>
>> >>
>> >> -Mark
>> >>
>> >>>
>> >>> Thanks,
>> >>> - eat
>> >>>>
>> >>>> -Mark
>> >>>> _______________________________________________
>> >>>> NumPy-Discussion mailing list
>> >>>> NumPy-Discussion at scipy.org
>> >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >>>>
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> NumPy-Discussion mailing list
>> >>> NumPy-Discussion at scipy.org
>> >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >>>
>> >>
>> >>
>> >> _______________________________________________
>> >> NumPy-Discussion mailing list
>> >> NumPy-Discussion at scipy.org
>> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >>
>> >
>> >
>> > _______________________________________________
>> > NumPy-Discussion mailing list
>> > NumPy-Discussion at scipy.org
>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >
>> >
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>