[Numpy-discussion] NA/Missing Data Conference Call Summary

Skipper Seabold jsseabold at gmail.com
Wed Jul 6 21:43:25 EDT 2011


On Wed, Jul 6, 2011 at 7:14 PM, Christopher Jordan-Squire
<cjordan1 at uw.edu> wrote:
> On Wed, Jul 6, 2011 at 3:47 PM, <josef.pktd at gmail.com> wrote:
>> On Wed, Jul 6, 2011 at 4:38 PM,  <josef.pktd at gmail.com> wrote:
>> > On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire
<snip>
>> >> Mean value replacement, or more generally single scalar value
>> >> replacement,
>> >> is generally not a good idea. It biases downward your standard error
>> >> estimates if you use mean replacement, and it will bias both if you use
>> >> anything other than mean replacement. The bias is gets worse with more
>> >> missing data. So it's worst in the precisely the cases where you'd want
>> >> to
>> >> fill in the data the most. (Though I admit I'm not too familiar with
>> >> time
>> >> series, so maybe this doesn't apply. But it's true as a general
>> >> principle in
>> >> statistics.) I'm not sure why we'd want to make this use case easier.
>>
>> Another qualification on this (I cannot help it).
>> I think this only applies if you use a prefabricated no-missing-values
>> algorithm. If I write it myself, I can do the proper correction for
>> the reduced number of observations. (similar to the case when we
>> ignore correlated information and use statistics based on uncorrelated
>> observations which also overestimate the amount of information we have
>> available.)
>>
>
> Can you do that sort of technique with longitudinal (panel) data? I'm
> honestly curious because I haven't looked into such corrections before. I
> haven't been able to find a reference after a few quick google searches. I
> don't suppose you know one off the top of your head?
> And you're right about the last measurement carried forward. I was just
> thinking about filling in all missing values with the same value.
> -Chris Jordan-Squire
> PS--Thanks for mentioning the statsmodels discussion. I'd been keeping track
> of that on a different email account, and I haven't realized it wasn't
> forwarding those messages correctly.
>

Maybe a bit OT, but I've seen people doing imputation using Bayesian
MCMC or multiple imputation for missing values in panel data. Google
'data augmentation' or 'multiple imputation'. I haven't looked much
into the details yet, but it's definitely not mean replacement.

FWIW (I haven't been following closely the discussion), there is a
distinction in statistics between ignorable and nonignorable missing
data, but I can't think of a situation where I would need this at the
computational level rather than relying on a (numerically comparable)
missing data type(s) a la SAS/Stata. I've also found the odd examples
of IGNORE without a clear answer to be scary.

Skipper



More information about the NumPy-Discussion mailing list