[SciPy-User] Correlation coefficient of large arrays

josef.pktd at gmail.com josef.pktd at gmail.com
Tue Mar 16 01:56:49 EDT 2010


On Tue, Mar 16, 2010 at 1:39 AM, Vincent Davis <vincent at vincentdavis.net>wrote:

> @Josef
>
> how much memory does a
>
> >>> 230000**2 = 52900000000L  float (double) array take ?
>
>
>
> I guess I don't have a real appreciation for how large this is. I can do
> this numpy.ones((100000,50000),dtype=np.float64) and it uses about 85% of
> the memory I have available. But thats a long ways from 230,000X230,000. Of
> course the array is symmetric.
>
> Is it feasible to do writing it to the disk?
> The end goal is to find the difference between two correlation arrays and
> then calculate the mean of each column. Which then leaves me with an array
> 1X230,000
>

If you don't really care about the correlation matrix itself and only need
the column (or row) sum then I would just loop over it in batches and never
construct the full matrix.
e.g. take the first 1000 variables and calculate the correlation with all
variables (1000 * 230000 -> 1000 for sum)
and loop.
Not using np.corrcoef would avoid some duplicate calculations, but there are
still several intermediate arrays necessary. So maybe using pytables or
similar would still be better to avoid duplicate calculations.

Josef



>
>   *Vincent Davis
> 720-301-3003 *
> vincent at vincentdavis.net
>  my blog <http://vincentdavis.net> | LinkedIn<http://www.linkedin.com/in/vincentdavis>
>
>
> On Mon, Mar 15, 2010 at 11:16 PM, <josef.pktd at gmail.com> wrote:
>
>>
>>
>> On Tue, Mar 16, 2010 at 1:04 AM, Vincent Davis <vincent at vincentdavis.net>wrote:
>>
>>> I have an array 10 observations of 230,000 variables and what to find the
>>> correlation coefficient between each variable.
>>> numpy.corrcef(data) works except I can only do it with about 30,000
>>> variables at a time. numpy.corrcef(data[:30000]). It uses up a lot of
>>> memory.
>>> Is there a better way?
>>>
>>
>>
>> how much memory does a
>> >>> 230000**2
>> 52900000000L
>>
>> float (double) array take ?
>>
>> Josef
>> (I'm not going to try)
>>
>>
>>
>>>
>>>   *Vincent Davis
>>> 720-301-3003 *
>>> vincent at vincentdavis.net
>>>  my blog <http://vincentdavis.net> | LinkedIn<http://www.linkedin.com/in/vincentdavis>
>>>
>>>
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>>>
>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
>>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20100316/26a18f15/attachment.html>


More information about the SciPy-User mailing list