[scikit-learn] Number of informative features vs total number of features

Fri Apr 3 07:00:36 EDT 2020

Dear sklearn users,

I have just checked if the generated features were independents by 
computing the covariance and correlation matrices and it seems they are, 
so I really do not understand my results.
Any idea ?

Thanks for your help,
Best regards,
Ben

Le 31/03/2020 à 15:48, Benoît Presles a écrit :
> Dear sklearn users,
>
> I did some supervised classification simulations with the 
> make_classification function from sklearn increasing the number of 
> informative features from 1 out of 40 to 40 out of 40 (100%). I did 
> not generate any repeated or redundant features. I fixed the number of 
> classes to two and the number of clusters per class to one.
>
> I split the dataset 100 times using the StratifiedShuffleSplit 
> function into two subsets: a training set and a test set (80% - 20%). 
> I performed a logistic regression and calculated training and testing 
> accuracies and averaged the results over the 100 splits leading to a 
> mean training accuracy and a mean testing accuracy.
>
> I was expecting to get an increasing accuracy score as a function of 
> informative features for both the training and the test sets. On the 
> contrary, I have got the best training and test scores for one 
> informative feature. Why do I get these results ?
>
> Thanks for your help,
> Best regards,
> Ben
>
> Below the simulation code I have written:
>
> import numpy as np
> from sklearn.datasets import make_classification
> from sklearn.model_selection import StratifiedShuffleSplit
> from sklearn.preprocessing import StandardScaler
> from sklearn.linear_model import LogisticRegression
> from sklearn.metrics import accuracy_score
> import matplotlib.pyplot as plt
>
> RANDOM_SEED = 4
> n_inf = np.array([1, 5, 10, 15, 20, 25, 30, 35, 40])
>
> mean_training_score_array = np.array([])
> mean_testing_score_array = np.array([])
> for n_inf_value in n_inf:
>     X, y = make_classification(n_samples=2500,
>                                n_features=40,
>                                n_informative=n_inf_value,
>                                n_redundant=0,
>                                n_repeated=0,
>                                n_classes=2,
>                                n_clusters_per_class=1,
>                                random_state=RANDOM_SEED,
>                                shuffle=False)
>     #
>     print('Simulated data - number of informative features = ' + 
> str(n_inf_value))
>     #
>     sss = StratifiedShuffleSplit(n_splits=100, test_size=0.2, 
> random_state=RANDOM_SEED)
>     training_score_array = np.array([])
>     testing_score_array = np.array([])
>     for train_index_split, test_index_split in sss.split(X, y):
>         X_split_train, X_split_test = X[train_index_split], 
> X[test_index_split]
>         y_split_train, y_split_test = y[train_index_split], 
> y[test_index_split]
>         scaler = StandardScaler()
>         X_split_train = scaler.fit_transform(X_split_train)
>         X_split_test = scaler.transform(X_split_test)
>         lr = LogisticRegression(fit_intercept=True, max_iter=1e9, 
> verbose=0,
>                                 random_state=RANDOM_SEED, 
> solver='lbfgs', tol=1e-6, C=10)
>         lr.fit(X_split_train, y_split_train)
>         y_pred_train = lr.predict(X_split_train)
>         y_pred_test = lr.predict(X_split_test)
>         accuracy_train_score = accuracy_score(y_split_train, 
> y_pred_train)
>         accuracy_test_score = accuracy_score(y_split_test, y_pred_test)
>         training_score_array = np.append(training_score_array, 
> accuracy_train_score)
>         testing_score_array = np.append(testing_score_array, 
> accuracy_test_score)
>     mean_training_score_array = np.append(mean_training_score_array, 
> np.average(training_score_array))
>     mean_testing_score_array = np.append(mean_testing_score_array, 
> np.average(testing_score_array))
> #
> print('mean_training_score_array=' + str(mean_training_score_array))
> print('mean_testing_score_array=' + str(mean_testing_score_array))
> #
> plt.plot(n_inf, mean_training_score_array, 'r', label='mean training 
> score')
> plt.plot(n_inf, mean_testing_score_array, 'g', label='mean testing 
> score')
> plt.xlabel('number of informative features out of 40')
> plt.ylabel('accuracy')
> plt.legend()
> plt.show()
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn