[Numpy-discussion] non-standard standard deviation
Colin J. Williams
cjw at ncf.ca
Wed Dec 2 13:25:07 EST 2009
On 29-Nov-09 20:15 PM, Robin wrote:
> On Mon, Nov 30, 2009 at 12:30 AM, Colin J. Williams<cjw at ncf.ca> wrote:
>
>> On 29-Nov-09 17:13 PM, Dr. Phillip M. Feldman wrote:
>>
>>> All of the statistical packages that I am currently using and have used in
>>> the past (Matlab, Minitab, R, S-plus) calculate standard deviation using the
>>> sqrt(1/(n-1)) normalization, which gives a result that is unbiased when
>>> sampling from a normally-distributed population. NumPy uses the sqrt(1/n)
>>> normalization. I'm currently using the following code to calculate standard
>>> deviations, but would much prefer if this could be fixed in NumPy itself:
>>>
>>> def mystd(x=numpy.array([]), axis=None):
>>> """This function calculates the standard deviation of the input using the
>>> definition of standard deviation that gives an unbiased result for
>>> samples
>>> from a normally-distributed population."""
>>>
>>> xd= x - x.mean(axis=axis)
>>> return sqrt( (xd*xd).sum(axis=axis) / (numpy.size(x,axis=axis)-1.0) )
>>>
>>>
>> Anne Archibald has suggested a work-around. Perhaps ddof could be set,
>> by default to
>> 1 as other values are rarely required.
>>
>> Where the distribution of a variate is not known a priori, then I
>> believe that it can be shown
>> that the n-1 divisor provides the best estimate of the variance.
>>
> There have been previous discussions on this (but I can't find them
> now) and I believe the current default was chosen deliberately. I
> think it is the view of the numpy developers that the n divisor has
> more desireable properties in most cases than the traditional n-1 -
> see this paper by Travis Oliphant for details:
> http://hdl.handle.net/1877/438
>
> Cheers
>
> Robin
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
The conventional approach, based in the notion of Expected values is
given here:
http://en.wikipedia.org/wiki/Variance#Distribution_of_the_sample_variance
I would suggest that numpy should stick with that until the approach
advocated in: http://hdl.handle.net/1877/438
is generally accepted.
Thomas Bayes introduced some nebulous ideas that might not be relevant
for most cases when one is trying to find a confidence interval for a
mean: http://en.wikipedia.org/wiki/Thomas_bayes#Bayes.27_theorem
Colin W.
More information about the NumPy-Discussion
mailing list