numpy/scipy: correlation

Sun Nov 12 11:25:00 EST 2006

robert wrote:
> Robert Kern wrote:
> http://links.jstor.org/sici?sici=0162-1459(192906)24%3A166%3C170%3AFFPEOC%3E2.0.CO%3B2-Y 
> 
> tells: 
> probable error of r = 0.6745*(1-r**2)/sqrt(N)
> 
> A simple function of r and N - quite what I expected above roughly for 
> the N-only dep.. But thus it is not sensitive to above considerations 
> about 'boring' data. With above example it would spit a decrease of this 
> probable coef-err from
> 0.0115628571429 to 0.00548453410954 !

This 1929 formula for estimating the error of correlation coefficient seems to make some sense for r=0 . 
I do monte carlo on correlating random series:

>>> X=numpy.random.random(10000)
>>> l=[]
>>> for i in range(200): 
... 	Z=numpy.random.random(10000)
... 	l.append( sd.correlation((X,Z))[2] )   #collect coef's
... 
>>> mean(l)
0.000327657082234
>>> std(l)
0.0109120766158          # thats how the coef jitters
>>> std(l)/sqrt(len(l))
0.000771600337185
>>> len(l)
200

# now
# 0.6745*(1-r**2)/sqrt(N) = 0.0067440015079
# vs M.C.                   0.0109120766158 ± 0.000771600337185

but the fancy factor of 0.6745 is significantly just fancy for r=0.

then for a higher (0.5) correlation:

>>> l=[]
>>> for i in range(200): 
... 	Z=numpy.random.random(10000)+array(range(10000))/10000.0
... 	l.append( sd.correlation((X+array(range(10000))/10000.0,Z))[2] )
... 
>>> mean(l)
0.498905642552
>>> std(l)
0.00546979583163
>>> std(l)/sqrt(len(l))
0.000386772972425

#now:
# 0.6745*(1-r**2)/sqrt(N) = 0.00512173224849)
# vs M.C.                   0.00546979583163 ± 0.000386772972425

=> there the 0.6745 factor and (1-r**2) seem to get the main effect ! There is something in it.

--

Now adding boring data:

>>> boring=ones(10001)*0.5
>>> X=numpy.random.random(10000)
>>> l=[]
>>> for i in range(200): 
... 	Z=concatenate((numpy.random.random(10000)+array(range(10000))/10000.0,boring))
... 	l.append( sd.correlation((concatenate((X+array(range(10000))/10000.0,boring)),Z))[2] )
... 	
>>> mean(l)
0.712753628489             # r
>>> std(l)
0.00316163649888           # r_err
>>> std(l)/sqrt(len(l))
0.0002235614608

# now:
# 0.6745*(1-r**2)/sqrt(N) = 0.00234459971461       #N=20000
# vs M.C. streuung          0.00316163649888 ± 0.0002235614608

=> the boring data has an effect on coef-err which is significantly not reflected by the formula 0.6745*(1-r**2)/sqrt(N) 

=> I'll use this formula to get a downside error estimate for the correlation coefficient:

------------------------------------------
|  r_err_down ~= 1.0 * (1-r**2)/sqrt(N)  |
------------------------------------------

(until I find a better one respecting the actual distribution of data)

Would be interesting what MATLAB & Octave say ...

-robert