[scikit-learn] Number of informative features vs total number of features
Benoît Presles
benoit.presles at u-bourgogne.fr
Fri Apr 3 07:00:36 EDT 2020
Dear sklearn users,
I have just checked if the generated features were independents by
computing the covariance and correlation matrices and it seems they are,
so I really do not understand my results.
Any idea ?
Thanks for your help,
Best regards,
Ben
Le 31/03/2020 à 15:48, Benoît Presles a écrit :
> Dear sklearn users,
>
> I did some supervised classification simulations with the
> make_classification function from sklearn increasing the number of
> informative features from 1 out of 40 to 40 out of 40 (100%). I did
> not generate any repeated or redundant features. I fixed the number of
> classes to two and the number of clusters per class to one.
>
> I split the dataset 100 times using the StratifiedShuffleSplit
> function into two subsets: a training set and a test set (80% - 20%).
> I performed a logistic regression and calculated training and testing
> accuracies and averaged the results over the 100 splits leading to a
> mean training accuracy and a mean testing accuracy.
>
> I was expecting to get an increasing accuracy score as a function of
> informative features for both the training and the test sets. On the
> contrary, I have got the best training and test scores for one
> informative feature. Why do I get these results ?
>
> Thanks for your help,
> Best regards,
> Ben
>
> Below the simulation code I have written:
>
> import numpy as np
> from sklearn.datasets import make_classification
> from sklearn.model_selection import StratifiedShuffleSplit
> from sklearn.preprocessing import StandardScaler
> from sklearn.linear_model import LogisticRegression
> from sklearn.metrics import accuracy_score
> import matplotlib.pyplot as plt
>
> RANDOM_SEED = 4
> n_inf = np.array([1, 5, 10, 15, 20, 25, 30, 35, 40])
>
> mean_training_score_array = np.array([])
> mean_testing_score_array = np.array([])
> for n_inf_value in n_inf:
> X, y = make_classification(n_samples=2500,
> n_features=40,
> n_informative=n_inf_value,
> n_redundant=0,
> n_repeated=0,
> n_classes=2,
> n_clusters_per_class=1,
> random_state=RANDOM_SEED,
> shuffle=False)
> #
> print('Simulated data - number of informative features = ' +
> str(n_inf_value))
> #
> sss = StratifiedShuffleSplit(n_splits=100, test_size=0.2,
> random_state=RANDOM_SEED)
> training_score_array = np.array([])
> testing_score_array = np.array([])
> for train_index_split, test_index_split in sss.split(X, y):
> X_split_train, X_split_test = X[train_index_split],
> X[test_index_split]
> y_split_train, y_split_test = y[train_index_split],
> y[test_index_split]
> scaler = StandardScaler()
> X_split_train = scaler.fit_transform(X_split_train)
> X_split_test = scaler.transform(X_split_test)
> lr = LogisticRegression(fit_intercept=True, max_iter=1e9,
> verbose=0,
> random_state=RANDOM_SEED,
> solver='lbfgs', tol=1e-6, C=10)
> lr.fit(X_split_train, y_split_train)
> y_pred_train = lr.predict(X_split_train)
> y_pred_test = lr.predict(X_split_test)
> accuracy_train_score = accuracy_score(y_split_train,
> y_pred_train)
> accuracy_test_score = accuracy_score(y_split_test, y_pred_test)
> training_score_array = np.append(training_score_array,
> accuracy_train_score)
> testing_score_array = np.append(testing_score_array,
> accuracy_test_score)
> mean_training_score_array = np.append(mean_training_score_array,
> np.average(training_score_array))
> mean_testing_score_array = np.append(mean_testing_score_array,
> np.average(testing_score_array))
> #
> print('mean_training_score_array=' + str(mean_training_score_array))
> print('mean_testing_score_array=' + str(mean_testing_score_array))
> #
> plt.plot(n_inf, mean_training_score_array, 'r', label='mean training
> score')
> plt.plot(n_inf, mean_testing_score_array, 'g', label='mean testing
> score')
> plt.xlabel('number of informative features out of 40')
> plt.ylabel('accuracy')
> plt.legend()
> plt.show()
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
More information about the scikit-learn
mailing list