[SciPy-User] Correlation coefficient of large arrays

Vincent Davis vincent at vincentdavis.net
Tue Mar 16 09:21:15 EDT 2010


Is there a way to calculate a column or row of the correlation matrix one at
a time?  I ma looking how including an additional set of observation effect
the correlation. For example if I have variables a,b,c,d..... and set of
observations 1-10 if the correlation is calculated for obs 1-5, I then add
observations 6-10 and what to know the average effect of this on the
correlation of c with (a,b,,d,e.....).
So I only need a column or a row at a time.
Just not clear to me how I would do this. I guess I just need to just DO IT.

  *Vincent Davis
720-301-3003 *
vincent at vincentdavis.net
 my blog <http://vincentdavis.net> |
LinkedIn<http://www.linkedin.com/in/vincentdavis>


On Mon, Mar 15, 2010 at 11:56 PM, <josef.pktd at gmail.com> wrote:

>
>
> On Tue, Mar 16, 2010 at 1:39 AM, Vincent Davis <vincent at vincentdavis.net>wrote:
>
>>  @Josef
>>
>> how much memory does a
>>
>> >>> 230000**2 = 52900000000L  float (double) array take ?
>>
>>
>>
>> I guess I don't have a real appreciation for how large this is. I can do
>> this numpy.ones((100000,50000),dtype=np.float64) and it uses about 85% of
>> the memory I have available. But thats a long ways from 230,000X230,000. Of
>> course the array is symmetric.
>>
>> Is it feasible to do writing it to the disk?
>> The end goal is to find the difference between two correlation arrays and
>> then calculate the mean of each column. Which then leaves me with an array
>> 1X230,000
>>
>
> If you don't really care about the correlation matrix itself and only need
> the column (or row) sum then I would just loop over it in batches and never
> construct the full matrix.
> e.g. take the first 1000 variables and calculate the correlation with all
> variables (1000 * 230000 -> 1000 for sum)
> and loop.
> Not using np.corrcoef would avoid some duplicate calculations, but there
> are still several intermediate arrays necessary. So maybe using pytables or
> similar would still be better to avoid duplicate calculations.
>
> Josef
>
>
>
>>
>>   *Vincent Davis
>> 720-301-3003 *
>> vincent at vincentdavis.net
>>  my blog <http://vincentdavis.net> | LinkedIn<http://www.linkedin.com/in/vincentdavis>
>>
>>
>> On Mon, Mar 15, 2010 at 11:16 PM, <josef.pktd at gmail.com> wrote:
>>
>>>
>>>
>>> On Tue, Mar 16, 2010 at 1:04 AM, Vincent Davis <vincent at vincentdavis.net
>>> > wrote:
>>>
>>>> I have an array 10 observations of 230,000 variables and what to find
>>>> the correlation coefficient between each variable.
>>>> numpy.corrcef(data) works except I can only do it with about 30,000
>>>> variables at a time. numpy.corrcef(data[:30000]). It uses up a lot of
>>>> memory.
>>>> Is there a better way?
>>>>
>>>
>>>
>>> how much memory does a
>>> >>> 230000**2
>>> 52900000000L
>>>
>>> float (double) array take ?
>>>
>>> Josef
>>> (I'm not going to try)
>>>
>>>
>>>
>>>>
>>>>   *Vincent Davis
>>>> 720-301-3003 *
>>>> vincent at vincentdavis.net
>>>>  my blog <http://vincentdavis.net> | LinkedIn<http://www.linkedin.com/in/vincentdavis>
>>>>
>>>>
>>>> _______________________________________________
>>>> SciPy-User mailing list
>>>> SciPy-User at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>>
>>>>
>>>
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>>>
>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
>>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20100316/2008e996/attachment.html>


More information about the SciPy-User mailing list