[SciPy-user] error estimate in stats.linregress

Bruce Southey bsouthey at gmail.com
Mon Feb 23 15:01:45 EST 2009


Hi,
Yes, the formula is incorrect. The reason is that the sum of squares 
terms are not corrected by the means because the ss function just 
computes the uncorrected sum of squares.

Thus the correct formula should :
sterrest = np.sqrt(((1-r*r)*(ss((y-ymean))))/(df*(ss(x-xmean))))
Alternatively:
sterrest = np.sqrt((1-r*r)*(ss(y)-n*ymean*ymean)/ (ss(x)-n*xmean*xmean) 
/ df)

Note the formula is derived using the definition of R-squared:
The estimated variance of the slope = MSE/Sxx= ((1-R*R)*Syy)/(df*Sxx)
where Syy and Sxx are the corrected sums of squares for Y and X, 
respectively,

Regards
Bruce

Hi all,
I was working with linear regression in scipy and met some problems
with value of standard error of the estimate returned by
scipy.stats.linregress() function. I could not compare it to similar
outputs of other linear regression routines (for example in Origin),
so I took a look in the source (stats.py).

In the source it is defined as
sterrest = np.sqrt((1-r*r)*ss(y) / ss(x) / df)
where r is correlation coefficient, df is degrees of freedom (N-2) and
ss() is sum of squares of elements.

After digging through literature the only formula looking somewhat the
same was found to be
stderrest = np.sqrt((1-r*r)*ss(y-y.mean())/df)
which gives the same result as a standard definition (in notation of
the source of linregress)
stderrest = np.sqrt(ss(y-slope*x-intercept)/df)
but the output of linregress is different.

I humbly suppose this is a bug, but maybe somebody could explain me
what is it if I'm wrong...
Pavlo.





More information about the SciPy-User mailing list