numpy/scipy: correlation
robert
no-spam at no-spam-no-spam.invalid
Sun Nov 12 11:25:00 EST 2006
robert wrote:
> Robert Kern wrote:
> http://links.jstor.org/sici?sici=0162-1459(192906)24%3A166%3C170%3AFFPEOC%3E2.0.CO%3B2-Y
>
> tells:
> probable error of r = 0.6745*(1-r**2)/sqrt(N)
>
> A simple function of r and N - quite what I expected above roughly for
> the N-only dep.. But thus it is not sensitive to above considerations
> about 'boring' data. With above example it would spit a decrease of this
> probable coef-err from
> 0.0115628571429 to 0.00548453410954 !
This 1929 formula for estimating the error of correlation coefficient seems to make some sense for r=0 .
I do monte carlo on correlating random series:
>>> X=numpy.random.random(10000)
>>> l=[]
>>> for i in range(200):
... Z=numpy.random.random(10000)
... l.append( sd.correlation((X,Z))[2] ) #collect coef's
...
>>> mean(l)
0.000327657082234
>>> std(l)
0.0109120766158 # thats how the coef jitters
>>> std(l)/sqrt(len(l))
0.000771600337185
>>> len(l)
200
# now
# 0.6745*(1-r**2)/sqrt(N) = 0.0067440015079
# vs M.C. 0.0109120766158 ± 0.000771600337185
but the fancy factor of 0.6745 is significantly just fancy for r=0.
then for a higher (0.5) correlation:
>>> l=[]
>>> for i in range(200):
... Z=numpy.random.random(10000)+array(range(10000))/10000.0
... l.append( sd.correlation((X+array(range(10000))/10000.0,Z))[2] )
...
>>> mean(l)
0.498905642552
>>> std(l)
0.00546979583163
>>> std(l)/sqrt(len(l))
0.000386772972425
#now:
# 0.6745*(1-r**2)/sqrt(N) = 0.00512173224849)
# vs M.C. 0.00546979583163 ± 0.000386772972425
=> there the 0.6745 factor and (1-r**2) seem to get the main effect ! There is something in it.
--
Now adding boring data:
>>> boring=ones(10001)*0.5
>>> X=numpy.random.random(10000)
>>> l=[]
>>> for i in range(200):
... Z=concatenate((numpy.random.random(10000)+array(range(10000))/10000.0,boring))
... l.append( sd.correlation((concatenate((X+array(range(10000))/10000.0,boring)),Z))[2] )
...
>>> mean(l)
0.712753628489 # r
>>> std(l)
0.00316163649888 # r_err
>>> std(l)/sqrt(len(l))
0.0002235614608
# now:
# 0.6745*(1-r**2)/sqrt(N) = 0.00234459971461 #N=20000
# vs M.C. streuung 0.00316163649888 ± 0.0002235614608
=> the boring data has an effect on coef-err which is significantly not reflected by the formula 0.6745*(1-r**2)/sqrt(N)
=> I'll use this formula to get a downside error estimate for the correlation coefficient:
------------------------------------------
| r_err_down ~= 1.0 * (1-r**2)/sqrt(N) |
------------------------------------------
(until I find a better one respecting the actual distribution of data)
Would be interesting what MATLAB & Octave say ...
-robert
More information about the Python-list
mailing list