[scikit-learn] Comparing Scikit and Xlstat for PCA analysis

Fri Dec 25 09:16:02 EST 2020

Hi
I have a test csv file and I have written a code to show the PCA for that.
I also use another tool in Excel (XLSTAT) to compare the results.
The XLSTAT automatically calculates the number of features, however, based
on my understanding, I have to specify how many components are needed using
the scikit package. For example, while XLSTAT shows 5 features:

Factor scores:
F1 F2 F3 F4 F5
A1 -1.293 -0.663 -0.462 -0.713 0.010
A2 -0.297 0.293 -1.429 0.397 0.056
A3 2.328 0.069 0.987 -0.108 0.062
A4 -0.556 -2.273 0.538 0.344 -0.032
A5 1.823 0.775 -0.597 -0.052 -0.085
A6 -2.005 1.799 0.963 0.133 -0.011

In the following code, I specified 2 components:

x = StandardScaler().fit_transform(x)
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
print( principalComponents )

[[-1.29292842 0.66325508] [-0.29706395 -0.29346337] [ 2.32751305
-0.06850045] [-0.5558091 2.27288988] [ 1.82312052 -0.77527304] [-2.0048321
-1.7989081 ]]

As you can see, the first column in XLSTAT and scikit are the same.
However, the second columns are negated.
For example, considering F1 and F2, we see

XLSTAT => -1.293 -0.663
scikit =>  [-1.29292842 0.66325508]

So, my questions are

1) Isn't there any way to use scikit for an unknown number of principal
components? So that I can query the number of principal components and use
a scree plot then.

2) Considering the F1 and F2 as a XY scatter point, I want to know why the
value of Y in XLSTAT and scikit are opposite?

The code which I write is available at https://pastebin.com/ghJQ6L4C
Any idea?
Regards,
Mahmood
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/scikit-learn/attachments/20201225/939b8693/attachment.html>