[Numpy-discussion] non-standard standard deviation

Wed Dec 2 13:25:07 EST 2009

On 29-Nov-09 20:15 PM, Robin wrote:
> On Mon, Nov 30, 2009 at 12:30 AM, Colin J. Williams<cjw at ncf.ca>  wrote:
>    
>> On 29-Nov-09 17:13 PM, Dr. Phillip M. Feldman wrote:
>>      
>>> All of the statistical packages that I am currently using and have used in
>>> the past (Matlab, Minitab, R, S-plus) calculate standard deviation using the
>>> sqrt(1/(n-1)) normalization, which gives a result that is unbiased when
>>> sampling from a normally-distributed population.  NumPy uses the sqrt(1/n)
>>> normalization.  I'm currently using the following code to calculate standard
>>> deviations, but would much prefer if this could be fixed in NumPy itself:
>>>
>>> def mystd(x=numpy.array([]), axis=None):
>>>      """This function calculates the standard deviation of the input using the
>>>      definition of standard deviation that gives an unbiased result for
>>> samples
>>>      from a normally-distributed population."""
>>>
>>>      xd= x - x.mean(axis=axis)
>>>      return sqrt( (xd*xd).sum(axis=axis) / (numpy.size(x,axis=axis)-1.0) )
>>>
>>>        
>> Anne Archibald has suggested a work-around.  Perhaps ddof could be set,
>> by default to
>> 1 as other values are rarely required.
>>
>> Where the distribution of a variate is not known a priori, then I
>> believe that it can be shown
>> that the n-1 divisor provides the best estimate of the variance.
>>      
> There have been previous discussions on this (but I can't find them
> now) and I believe the current default was chosen deliberately. I
> think it is the view of the numpy developers that the n divisor has
> more desireable properties in most cases than the traditional n-1 -
> see this paper by Travis Oliphant for details:
> http://hdl.handle.net/1877/438
>
> Cheers
>
> Robin
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>    
The conventional approach, based in the notion  of Expected values is 
given here: 
http://en.wikipedia.org/wiki/Variance#Distribution_of_the_sample_variance

I would suggest that numpy should stick with that until the approach 
advocated in: http://hdl.handle.net/1877/438
is generally accepted.

Thomas Bayes introduced some nebulous ideas that might not be relevant 
for most cases when one is trying to find a confidence interval for a 
mean: http://en.wikipedia.org/wiki/Thomas_bayes#Bayes.27_theorem

Colin W.