[SciPy-dev] Statistics toolbox and nans

A.J. Rossini rossini at blindglobe.net
Fri Nov 1 09:15:03 EST 2002


>>>>> "travis" == Travis Oliphant <oliphant.travis at ieee.org> writes:

    travis> Hello developers.
    travis> What should we do about nan's and the stats toolbox.  Stats is one
    travis> package where people may use nans to represent missing values.

Yech.  This is a hard issue, but NAN isn't the solution.  I believe
someone who used to be tracking this list, John Barnard, had written
some tools for implementing statistical procedures for handling
missing values (imputation-based) in python.  But I don't know the
state of that work.

(the issue being that one can want to distinguish between types of
missing values).


    travis> There are two options that I see. 

    travis> 1) MATLAB option

    travis> MATLAB defines 6 new functions nanmean, nanmedian, nansum, nanmin,
    travis> nanmax, and nanstd that ignore nans properly.  These can be used in
    travis> place of the normal functions which don't use nans properly.  Perhaps
    travis> they did this as an afterthought.

    travis> Note, this is an easy option and is (as of now) implemented in the CVS
    travis> scipy.


Would it be possible, instead to take the R approach, which is to have
a missing data handler?  (i.e. mean(x,missing=missing.drop()), where
the default is to drop, and the other options might be "replace with
mean", "replace with random sample", "user defined function", "barf
because we shouldn't compute with missings", etc.

Missing data handling is hard, and to be done right, needs to be
handled flexibly.

best,
-tony

-- 
A.J. Rossini				Rsrch. Asst. Prof. of Biostatistics
U. of Washington Biostatistics		rossini at u.washington.edu	
FHCRC/SCHARP/HIV Vaccine Trials Net	rossini at scharp.org
-------------- http://software.biostat.washington.edu/ ----------------
FHCRC: M: 206-667-7025 (fax=4812)|Voicemail is pretty sketchy/use Email
UW:   Th: 206-543-1044 (fax=3286)|Change last 4 digits of phone to FAX
(my tuesday/wednesday/friday locations are completely unpredictable.)






More information about the SciPy-Dev mailing list