[scikit-learn] Number of informative features vs total number of features

Andreas Mueller t3kcit at gmail.com
Fri Apr 3 10:51:15 EDT 2020


Hi Ben.
I'd recommend you check the code to see how the data is generated.

Best,
Andy

On 4/3/20 7:00 AM, Benoît Presles wrote:
> Dear sklearn users,
>
> I have just checked if the generated features were independents by 
> computing the covariance and correlation matrices and it seems they 
> are, so I really do not understand my results.
> Any idea ?
>
> Thanks for your help,
> Best regards,
> Ben
>
>
> Le 31/03/2020 à 15:48, Benoît Presles a écrit :
>> Dear sklearn users,
>>
>> I did some supervised classification simulations with the 
>> make_classification function from sklearn increasing the number of 
>> informative features from 1 out of 40 to 40 out of 40 (100%). I did 
>> not generate any repeated or redundant features. I fixed the number 
>> of classes to two and the number of clusters per class to one.
>>
>> I split the dataset 100 times using the StratifiedShuffleSplit 
>> function into two subsets: a training set and a test set (80% - 20%). 
>> I performed a logistic regression and calculated training and testing 
>> accuracies and averaged the results over the 100 splits leading to a 
>> mean training accuracy and a mean testing accuracy.
>>
>> I was expecting to get an increasing accuracy score as a function of 
>> informative features for both the training and the test sets. On the 
>> contrary, I have got the best training and test scores for one 
>> informative feature. Why do I get these results ?
>>
>> Thanks for your help,
>> Best regards,
>> Ben
>>
>> Below the simulation code I have written:
>>
>> import numpy as np
>> from sklearn.datasets import make_classification
>> from sklearn.model_selection import StratifiedShuffleSplit
>> from sklearn.preprocessing import StandardScaler
>> from sklearn.linear_model import LogisticRegression
>> from sklearn.metrics import accuracy_score
>> import matplotlib.pyplot as plt
>>
>> RANDOM_SEED = 4
>> n_inf = np.array([1, 5, 10, 15, 20, 25, 30, 35, 40])
>>
>> mean_training_score_array = np.array([])
>> mean_testing_score_array = np.array([])
>> for n_inf_value in n_inf:
>>     X, y = make_classification(n_samples=2500,
>>                                n_features=40,
>>                                n_informative=n_inf_value,
>>                                n_redundant=0,
>>                                n_repeated=0,
>>                                n_classes=2,
>>                                n_clusters_per_class=1,
>>                                random_state=RANDOM_SEED,
>>                                shuffle=False)
>>     #
>>     print('Simulated data - number of informative features = ' + 
>> str(n_inf_value))
>>     #
>>     sss = StratifiedShuffleSplit(n_splits=100, test_size=0.2, 
>> random_state=RANDOM_SEED)
>>     training_score_array = np.array([])
>>     testing_score_array = np.array([])
>>     for train_index_split, test_index_split in sss.split(X, y):
>>         X_split_train, X_split_test = X[train_index_split], 
>> X[test_index_split]
>>         y_split_train, y_split_test = y[train_index_split], 
>> y[test_index_split]
>>         scaler = StandardScaler()
>>         X_split_train = scaler.fit_transform(X_split_train)
>>         X_split_test = scaler.transform(X_split_test)
>>         lr = LogisticRegression(fit_intercept=True, max_iter=1e9, 
>> verbose=0,
>>                                 random_state=RANDOM_SEED, 
>> solver='lbfgs', tol=1e-6, C=10)
>>         lr.fit(X_split_train, y_split_train)
>>         y_pred_train = lr.predict(X_split_train)
>>         y_pred_test = lr.predict(X_split_test)
>>         accuracy_train_score = accuracy_score(y_split_train, 
>> y_pred_train)
>>         accuracy_test_score = accuracy_score(y_split_test, y_pred_test)
>>         training_score_array = np.append(training_score_array, 
>> accuracy_train_score)
>>         testing_score_array = np.append(testing_score_array, 
>> accuracy_test_score)
>>     mean_training_score_array = np.append(mean_training_score_array, 
>> np.average(training_score_array))
>>     mean_testing_score_array = np.append(mean_testing_score_array, 
>> np.average(testing_score_array))
>> #
>> print('mean_training_score_array=' + str(mean_training_score_array))
>> print('mean_testing_score_array=' + str(mean_testing_score_array))
>> #
>> plt.plot(n_inf, mean_training_score_array, 'r', label='mean training 
>> score')
>> plt.plot(n_inf, mean_testing_score_array, 'g', label='mean testing 
>> score')
>> plt.xlabel('number of informative features out of 40')
>> plt.ylabel('accuracy')
>> plt.legend()
>> plt.show()
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn



More information about the scikit-learn mailing list