[SciPy-user] stats review: std/var and samplestd/samplevar

Mon Apr 3 09:24:43 EDT 2006

Hi again folks,

> I think the original poster meant (N-1) some of the time when they
> said (1-N).

Yeah, sorry.

The take-home message is that scipy.stats uses "sample variance" to  
mean "a variance denominated by N", when the rest of the world uses  
"sample variance" to mean "an estimator of the population variance  
denominated by N-1 or N", and scipy.stats uses "variance" to mean  
"the unbiased estimator of population variance (denominated by N-1)",  
which is not in general what "variance" means.

In both cases, these usages are not clear, and in the "sample" case,  
it is directly contrary to established usage.

> why not simply have scipy.stats.var (and std) with an option for
> whether you want N or N-1?

How do people feel about this? The folks on the numpy list have  
relatively strong feelings that when functions have a boolean flag  
such as you're proposing, then that means that they really should be  
two functions. I'm not really sure how strongly I feel about that.

Would it be OK to have scipy.stats.var have an boolean  
'unbiased_estimator' or 'UnbiasedEstimator' flag?

I'm rather not sure that scipy.stats.std ought to have such a flag,  
given the caveats (e.g. that there is no general unbiased estimator),  
but if that's what people want...

Zach

>
>> I would propose to have:
>> (1) scipy.stats.var and scipy.stats.std -- use N as the denominator
>>
>> (2) scipy.stats.samplevar and scipy.stats.samplesdt -- at least use
>> n-1 as the denominator. Better would be to deprecate / remove them
>> because as above "sample variance" is ambiguous.
>>
>> (3) scipy.stats.var_unbiased -- use n-1 as denominator. As per the
>> note below, there is no general unbiased estimator of the standard
>> deviation, and so there should be no scipy.stats.std_unbiased
>> function. (See the wikipedia entry and also http://www.itl.nist.gov/
>> div898/handbook/pmc/section3/pmc32.htm )
>
>
>> I feel vaguely that the N-1 estimator is always problematic, because
>> if you have a small enough sample that it makes a difference, you've
>> got bigger problems than using N or N-1. Not that these problems are
>> insurmountable, but you've got to have some statistical savvy to deal
>> properly with them. As such, I think that the default functions (var
>> and std) should just return the population statistics. But reasonable
>> people may disagree.
>
> Whilst you might argue that N vs N-1 isn't going to make much of a
> difference on a large sample, I am still strongly of the opinion that
> it should be an option.
>
> why not simply have scipy.stats.var (and std) with an option for
> whether you want N or N-1?
>
> Matthew
>
> -- 
> Matthew Vernon MA VetMB LGSM MRCVS
> Farm Animal Epidemiology and Informatics Unit
> Department of Veterinary Medicine, University of Cambridge
> http://www.cus.cam.ac.uk/~mcv21/
>
>
>
> _______________________________________________
> SciPy-user mailing list
> SciPy-user at scipy.net
> http://www.scipy.net/mailman/listinfo/scipy-user