[SciPy-User] [R] Correlation coefficient of large data sets

Vincent Davis vincent at vincentdavis.net
Tue Mar 16 09:18:32 EDT 2010


>
>  @ Dennis

With 35000 variables at a time, the storage is under

20 Gb; you'd have to compute about 50 such chunks to get the entire matrix.


Is there a way to calculate a column or row of the correlation matrix one at
a time? I ma looking how including an additional set of observation effect
the correlation. For example if I have variables a,b,c,d..... and set of
observations 1-10 if the correlation is calculated for obs 1-5, I then add
observations 6-10 and what to know the average effect of this on the
correlation of c with (a,b,,d,e.....).
So I only need a column or a row at a time. Just not clear to me how I would
do this.

@Joshua Wiley

cor(my.data) # calculate the correlation matrix between all variables
> (columns) of my.data
>

  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>


On Tue, Mar 16, 2010 at 12:06 AM, Joshua Wiley <jwiley.psych at gmail.com>wrote:

> I think what you have done should be fine.  read.table() will return a
> data frame, which cor() can handle happily.  For example:
>
> my.data <- read.table("file.csv", header = TRUE, row.names = 1,
> sep=",", strip.white = TRUE) # assign your data to "my.data"
>
> cor(my.data) # calculate the correlation matrix between all variables
> (columns) of my.data
>
> What happens if you try that?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20100316/94c076db/attachment.html>


More information about the SciPy-User mailing list