[scikit-learn] Finding the PC that captures a specific variable

Sun Jan 24 06:52:57 EST 2021

Hi Mahmood, 

the information you need is given by the individual explained variance for each variable / feature. You get that information from the hoggorm package (Python):

https://github.com/olivertomic/hoggorm

https://hoggorm.readthedocs.io/en/latest/index.html 

Here is one of the PCA examples provided in a Jupyter notebook:

https://github.com/olivertomic/hoggorm/blob/master/examples/PCA/PCA_on_cancer_data.ipynb

When you do PCA you get the information by calling for example:

cumCalExplVar_individualVariable = model.X_cumCalExplVar() (which gives you the cumulative calibrated explained variance for each variable, cell 21 in the notebook)

cumValExplVar_individualVariable = model.X_cumValExplVar_indVar() (which gives you the cumulative validated explained variance variable, cell 30 in the notebook)

The component where you get the biggest jump for the variable of interest is the component you are looking for. 

You could also have a look at the correlation loadings to identify the component you are looking for. 

cheers
Oliver

---- On Fri, 22 Jan 2021 21:48:46 +0100 Mahmood Naderan <mahmood.nt at gmail.com> wrote ----

Hi 
Thanks for the replies. I read about the available functions in the 
PCA section. Consider the following code 

x = StandardScaler().fit_transform(x) 
pca = PCA() 
principalComponents = pca.fit_transform(x) 
principalDf = pd.DataFrame(data = principalComponents) 
loadings = pca.components_ 
finalDf = pd.concat([principalDf, pd.DataFrame(targets, columns=['kernel'])], 1) 
print( "First and second observations\n", finalDf.loc[0:1] ) 
print( "loadings[0:1]\n", loadings[0], loadings[1] ) 
print ("explained_variance_ratio_\n",pca.explained_variance_ratio_) 

The output looks like 

First and second observations 
0 1 2 3 4 kernel 
0 2.959846 -0.184307 -0.100236 0.533735 -0.002227 ELEC1 
1 0.390313 1.805239 0.029688 -0.502359 -0.002350 ELECT2 
loadings[0:1] 
[0.21808984 0.49137412 0.46511098 0.49735819 0.49728754] [-0.94878375 
-0.01257726 0.29718078 0.07493325 0.07562934] 
explained_variance_ratio_ 
[7.80626876e-01 1.79854061e-01 2.50729844e-02 1.44436687e-02 2.40984767e-06] 

As you can see for two kernels named ELEC1 and ELEC2, there are five 
PCs from 0 to 4. 
Now based on the numbers in the loadings, I expect that loadings[0] 
which is the first variable is better shown on PC1-PC2 plane 
(0.49137412,0.46511098). However, loadings[1] which is the second 
variable is better shown on PC0-PC2 plane (-0.94878375,0.29718078). 
Is this understanding correct? 

I don't understand what explained_variance_ratio_ is trying to say here. 

Regards, 
Mahmood 

On Fri, Jan 22, 2021 at 11:52 AM Nicolas Hug <mailto:niourf at gmail.com> wrote: 
> 
> Hi Mahmood, 
> 
> There are different pieces of info that you can get from PCA: 
> 
> 1. How important is a given PC to reconstruct the entire dataset -> This 
> is given by explained_variance_ratio_ as Guillaume suggested 
> 
> 2. What is the contribution of each feature to each PC (remember that a 
> PC is a linear combination of all the features i.e.: PC_1 = X_1 . 
> alpha_11 + X_2 . alpha_12 + ... X_m . alpha_1m). The alpha_ij are what 
> you're looking for and they are given in the components_ matrix which is 
> a n_components x n_features matrix. 
> 
> Nicolas 
> 
> On 1/22/21 9:13 AM, Mahmood Naderan wrote: 
> > Hi 
> > I have a question about PCA and that is, how we can determine, a 
> > variable, X,  is better captured by which factor (principal 
> > component)? For example, maybe one variable has low weight in the 
> > first PC but has a higher weight in the fifth PC. 
> > 
> > When I use the PCA from Scikit, I have to manually work with the PCs, 
> > therefore, I may miss the point that although a variable is weak in 
> > PC1-PC2 plot, it may be strong in PC4-PC5 plot. 
> > 
> > Any comment on that? 
> > 
> > Regards, 
> > Mahmood 
> > _______________________________________________ 
> > scikit-learn mailing list 
> > mailto:scikit-learn at python.org 
> > https://mail.python.org/mailman/listinfo/scikit-learn 
> _______________________________________________ 
> scikit-learn mailing list 
> mailto:scikit-learn at python.org 
> https://mail.python.org/mailman/listinfo/scikit-learn 
_______________________________________________ 
scikit-learn mailing list 
mailto:scikit-learn at python.org 
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/scikit-learn/attachments/20210124/ce24f0e3/attachment.html>