[Numpy-discussion] NA/Missing Data Conference Call Summary

Wed Jul 6 16:22:26 EDT 2011

On Wed, Jul 6, 2011 at 1:08 PM, <josef.pktd at gmail.com> wrote:

> On Wed, Jul 6, 2011 at 3:38 PM, Christopher Jordan-Squire
> <cjordan1 at uw.edu> wrote:
> >
> >
> > On Wed, Jul 6, 2011 at 11:38 AM, Christopher Barker <
> Chris.Barker at noaa.gov>
> > wrote:
> >>
> >> Christopher Jordan-Squire wrote:
> >> > If we follow those rules for IGNORE for all computations, we sometimes
> >> > get some weird output. For example:
> >> > [ [1, 2], [3, 4] ] * [ IGNORE, 7] = [ 15, 31 ]. (Where * is matrix
> >> > multiply and not * with broadcasting.) Or should that sort of
> operation
> >> > through an error?
> >>
> >> That should throw an error -- matrix computation is heavily influenced
> >> by the shape and size of matrices, so I think IGNORES really don't make
> >> sense there.
> >>
> >>
> >
> > If the IGNORES don't make sense in basic numpy computations then I'm
> kinda
> > confused why they'd be included at the numpy core level.
> >
> >>
> >> Nathaniel Smith wrote:
> >> > It's exactly this transparency that worries Matthew and me -- we feel
> >> > that the alterNEP preserves it, and the NEP attempts to erase it. In
> >> > the NEP, there are two totally different underlying data structures,
> >> > but this difference is blurred at the Python level. The idea is that
> >> > you shouldn't have to think about which you have, but if you work with
> >> > C/Fortran, then of course you do have to be constantly aware of the
> >> > underlying implementation anyway.
> >>
> >> I don't think this bothers me -- I think it's analogous to things in
> >> numpy like Fortran order and non-contiguous arrays -- you can ignore all
> >> that when working in pure python when performance isn't critical, but
> >> you need a deeper understanding if you want to work with the data in C
> >> or Fortran or to tune performance in python.
> >>
> >> So as long as there is an API to query and control how things work, I
> >> like that it's hidden from simple python code.
> >>
> >> -Chris
> >>
> >>
> >
> > I'm similarly not too concerned about it. Performance seems finicky when
> > you're dealing with missing data, since a lot of arrays will likely have
> to
> > be copied over to other arrays containing only complete data before being
> > handed over to BLAS.
>
> Unless you know the neutral value for the computation or you just want
> to do a forward_fill in time series, and you have to ask the user not
> to give you an unmutable array with NAs if they don't want extra
> copies.
>
> Josef
>
>
Mean value replacement, or more generally single scalar value replacement,
is generally not a good idea. It biases downward your standard error
estimates if you use mean replacement, and it will bias both if you use
anything other than mean replacement. The bias is gets worse with more
missing data. So it's worst in the precisely the cases where you'd want to
fill in the data the most. (Though I admit I'm not too familiar with time
series, so maybe this doesn't apply. But it's true as a general principle in
statistics.) I'm not sure why we'd want to make this use case easier.

-Chris Jordan-Squire

> > My primary concern is that the np.NA stuff 'just
> > works'. Especially since I've never run into use cases in statistics
> where
> > the difference between IGNORE and NA mattered.
> >
> >
> >>
> >>
> >> --
> >> Christopher Barker, Ph.D.
> >> Oceanographer
> >>
> >> Emergency Response Division
> >> NOAA/NOS/OR&R            (206) 526-6959   voice
> >> 7600 Sand Point Way NE   (206) 526-6329   fax
> >> Seattle, WA  98115       (206) 526-6317   main reception
> >>
> >> Chris.Barker at noaa.gov
> >> _______________________________________________
> >> NumPy-Discussion mailing list
> >> NumPy-Discussion at scipy.org
> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110706/800e2b49/attachment.html>