[SciPy-user] stats review: std/var and samplestd/samplevar

Sun Apr 2 17:39:20 EDT 2006

Hi folks -

It appears to me that the scipy.stats implementations for calculating  
sample variances  and population variances (and hence standard  
deviations too) are somehow reversed.

Specifically, the variance of an entire population is calculated with  
a denominator of the population size N. The variance of a sample from  
a population is either estimated using a denominator of the sample  
size n (to obtain a biased estimate) or 1-n (to obtain an unbiased  
estimate). Note that saying "sample variance" does not imply the use  
of the 1-n estimator, as there are cases in which the biased  
estimator may legitimately be used.(*)

see e.g.:
http://en.wikipedia.org/wiki/Variance
http://en.wikipedia.org/wiki/Standard_deviation

However, scipy.stats.std and scipy.stats.var use 1-N, while  
scipy.stats.samplestd and scipy.stats.samplevar use N. This is  
clearly incorrect notation any way you slice it.

I would propose to have:
(1) scipy.stats.var and scipy.stats.std -- use N as the denominator

(2) scipy.stats.samplevar and scipy.stats.samplesdt -- at least use  
n-1 as the denominator. Better would be to deprecate / remove them  
because as above "sample variance" is ambiguous.

(3) scipy.stats.var_unbiased -- use n-1 as denominator. As per the  
note below, there is no general unbiased estimator of the standard  
deviation, and so there should be no scipy.stats.std_unbiased  
function. (See the wikipedia entry and also http://www.itl.nist.gov/ 
div898/handbook/pmc/section3/pmc32.htm )

I feel vaguely that the N-1 estimator is always problematic, because  
if you have a small enough sample that it makes a difference, you've  
got bigger problems than using N or N-1. Not that these problems are  
insurmountable, but you've got to have some statistical savvy to deal  
properly with them. As such, I think that the default functions (var  
and std) should just return the population statistics. But reasonable  
people may disagree.

Zach Pincus

Program in Biomedical Informatics and Department of Biochemistry
Stanford University School of Medicine

(*) E.g.: While it is possible to estimate the variance in an  
unbiased manner, estimating the standard deviation of a population  
from a sample without bias is actually impossible without assumptions  
about the population. (There is a complex correction factor for  
samples from normal populations discussed on the NIST page.)

Moreover, though the (N-1)-denominated estimator of the variance is  
unbiased, the estimator itself has a greater variance around the true  
value than the N-denominated estimator. As such, using the unbiased  
estimator can sap statistical power from some tests. This is why  
sometimes one might use the N-denominated estimator for the sample  
variance.