To apply pca for a large csv

Oscar Benjamin oscar.j.benjamin at gmail.com
Tue Apr 14 18:43:17 EDT 2020


On Tue, 14 Apr 2020 at 12:42, Rahul Gupta <rahulgupta100689 at gmail.com> wrote:
>
> Hello all, i have a csv of 1 gb which consists of 25000 columns and 20000 rows. I want to apply pca so i have seen sciki-learn had inbuilt fucntionality to use that. But i have seen to do eo you have to load data in data frame. But my machine is i5 with 8 gb of ram which fails to load all this data in data frame and shows memory error. Is there any alternative way that still i could aaply PCA on the same machine to the same rata set

Do you know how to compute a covariance matrix "manually"? If so then
it can be done while reading the data line by line without reading all
of the data into memory at once. The problem though is that your 25000
columns mean that the matrix itself will fill most of your memory
(25000**2*8 bytes == 5 GB using double precision floating point).

You can make life much easier for yourself by choosing a subset of the
columns that you are likely to be interested in and reducing the size
of your dataset before you begin.

--
Oscar


More information about the Python-list mailing list