[Numpy-discussion] non-standard standard deviation

Sun Dec 6 11:01:13 EST 2009

On 04-Dec-09 10:54 AM, Bruce Southey wrote:
> On 12/04/2009 06:18 AM, yogesh karpate wrote:
>> @ Pauli and @ Colin:
>>                                   Sorry for the late reply. I was 
>> busy in some other assignments.
>> # As far as  normalization by(n) is concerned then its common 
>> assumption that the population is normally distributed and population 
>> size is fairly large enough to fit the normal distribution. But this 
>> standard deviation, when applied to a small population, tends to be 
>> too low therefore it is called  as biased.
>> # The correction known as bessel correction is there for small sample 
>> size std. deviation. i.e. normalization by (n-1).
>> # In "electrical-and-electronic-measurements-and-instrumentation" by 
>> A.K. Sawhney . In 1st chapter of the book "Fundamentals of 
>> Meausrements " . Its shown that for N=16 the std. deviation 
>> normalization was (n-1)=15
>> # While I was learning statistics in my course Instructor would 
>> advise to take n=20 for normalization by (n-1)
>> # Probability and statistics by Schuam Series  is good reading.
>> Regards
>> ~ymk
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>    
> Hi,
> Basically, all that I see with these arbitrary values is that you are 
> relying on the 'central limit theorem' 
> (http://en.wikipedia.org/wiki/Central_limit_theorem).  Really the 
> issue in using these values is how much statistical bias will you 
> tolerate especially in the impact on usage of that estimate because 
> the usage of variance (such as in statistical tests) tend to be more 
> influenced by bias than the estimate of variance. (Of course, many 
> features rely on asymptotic properties so bias concerns are less 
> apparent in large sample sizes.)
>
> Obviously the default relies on the developers background and 
> requirements. There are multiple valid variance estimators in 
> statistics with different denominators like N (maximum likelihood 
> estimator), N-1 (restricted maximum likelihood estimator and certain 
> Bayesian estimators) and Stein's 
> (http://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator). So 
> thecurrent default behavior is a valid and documented. Consequently 
> you can not just have one option or different functions (like certain 
> programs) and Numpy's implementation actually allows you do all these 
> in a single function. So I also see no reason change even if I have to 
> add the ddof=1 argument, after all 'Explicit is better than implicit' :-).
>
> Bruce
Bruce,

I suggest that the Central Limit Theorem is tied in with the Law of 
Large Numbers.

When one has a smallish sample size, what give the best estimate of the 
variance?  The Bessel Correction provides a rationale, based on 
expectations: (http://en.wikipedia.org/wiki/Bessel%27s_correction).

It is difficult to understand the proof of Stein: 
http://en.wikipedia.org/wiki/Proof_of_Stein%27s_example

The symbols used are not clearly stated.  He seems interested in a 
decision rule for the calculation of the mean of a sample and claims 
that his approach is better than the traditional Least Squares approach.

In most cases, the interest is likely to be in the variance, with a view 
to establishing a confidence interval.

In the widely used Analysis of Variance (ANOVA), the degrees of freedom 
are reduced for each mean estimated, see:
http://www.mnstate.edu/wasson/ed602lesson13.htm for the example below:

*Analysis of Variance Table* ** Source of
Variation 	Sum of
Squares 	Degrees of
Freedom 	Mean
Square 	F Ratio 	p
Between Groups 	25.20 	2 	12.60 	5.178 	<.05
Within Groups 	29.20 	12 	2.43 	

Total 	54.40 	14 	

There is a sample of 15 observations, which is divided into three 
groups, depending on the number of hours of therapy.
Thus, the Total degrees of freedom are 15-1 = 14,  the Between Groups 
3-1 = 2 and the Residual is 14 - 2 = 12.

Colin W.