[Numpy-discussion] non-standard standard deviation
Colin J. Williams
cjw at ncf.ca
Sun Dec 6 11:01:13 EST 2009
On 04-Dec-09 10:54 AM, Bruce Southey wrote:
> On 12/04/2009 06:18 AM, yogesh karpate wrote:
>> @ Pauli and @ Colin:
>> Sorry for the late reply. I was
>> busy in some other assignments.
>> # As far as normalization by(n) is concerned then its common
>> assumption that the population is normally distributed and population
>> size is fairly large enough to fit the normal distribution. But this
>> standard deviation, when applied to a small population, tends to be
>> too low therefore it is called as biased.
>> # The correction known as bessel correction is there for small sample
>> size std. deviation. i.e. normalization by (n-1).
>> # In "electrical-and-electronic-measurements-and-instrumentation" by
>> A.K. Sawhney . In 1st chapter of the book "Fundamentals of
>> Meausrements " . Its shown that for N=16 the std. deviation
>> normalization was (n-1)=15
>> # While I was learning statistics in my course Instructor would
>> advise to take n=20 for normalization by (n-1)
>> # Probability and statistics by Schuam Series is good reading.
>> Regards
>> ~ymk
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
> Hi,
> Basically, all that I see with these arbitrary values is that you are
> relying on the 'central limit theorem'
> (http://en.wikipedia.org/wiki/Central_limit_theorem). Really the
> issue in using these values is how much statistical bias will you
> tolerate especially in the impact on usage of that estimate because
> the usage of variance (such as in statistical tests) tend to be more
> influenced by bias than the estimate of variance. (Of course, many
> features rely on asymptotic properties so bias concerns are less
> apparent in large sample sizes.)
>
> Obviously the default relies on the developers background and
> requirements. There are multiple valid variance estimators in
> statistics with different denominators like N (maximum likelihood
> estimator), N-1 (restricted maximum likelihood estimator and certain
> Bayesian estimators) and Stein's
> (http://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator). So
> thecurrent default behavior is a valid and documented. Consequently
> you can not just have one option or different functions (like certain
> programs) and Numpy's implementation actually allows you do all these
> in a single function. So I also see no reason change even if I have to
> add the ddof=1 argument, after all 'Explicit is better than implicit' :-).
>
> Bruce
Bruce,
I suggest that the Central Limit Theorem is tied in with the Law of
Large Numbers.
When one has a smallish sample size, what give the best estimate of the
variance? The Bessel Correction provides a rationale, based on
expectations: (http://en.wikipedia.org/wiki/Bessel%27s_correction).
It is difficult to understand the proof of Stein:
http://en.wikipedia.org/wiki/Proof_of_Stein%27s_example
The symbols used are not clearly stated. He seems interested in a
decision rule for the calculation of the mean of a sample and claims
that his approach is better than the traditional Least Squares approach.
In most cases, the interest is likely to be in the variance, with a view
to establishing a confidence interval.
In the widely used Analysis of Variance (ANOVA), the degrees of freedom
are reduced for each mean estimated, see:
http://www.mnstate.edu/wasson/ed602lesson13.htm for the example below:
*Analysis of Variance Table* ** Source of
Variation Sum of
Squares Degrees of
Freedom Mean
Square F Ratio p
Between Groups 25.20 2 12.60 5.178 <.05
Within Groups 29.20 12 2.43
Total 54.40 14
There is a sample of 15 observations, which is divided into three
groups, depending on the number of hours of therapy.
Thus, the Total degrees of freedom are 15-1 = 14, the Between Groups
3-1 = 2 and the Residual is 14 - 2 = 12.
Colin W.
More information about the NumPy-Discussion
mailing list