From benoit.presles at u-bourgogne.fr Fri Apr 3 07:00:36 2020 From: benoit.presles at u-bourgogne.fr (=?UTF-8?Q?Beno=c3=aet_Presles?=) Date: Fri, 3 Apr 2020 13:00:36 +0200 Subject: [scikit-learn] Number of informative features vs total number of features In-Reply-To: <10c2473f-50e3-c959-b9f7-07c2b903c840@u-bourgogne.fr> References: <10c2473f-50e3-c959-b9f7-07c2b903c840@u-bourgogne.fr> Message-ID: <525257ad-dad6-cf48-9749-4461452a3a72@u-bourgogne.fr> Dear sklearn users, I have just checked if the generated features were independents by computing the covariance and correlation matrices and it seems they are, so I really do not understand my results. Any idea ? Thanks for your help, Best regards, Ben Le 31/03/2020 ? 15:48, Beno?t Presles a ?crit?: > Dear sklearn users, > > I did some supervised classification simulations with the > make_classification function from sklearn increasing the number of > informative features from 1 out of 40 to 40 out of 40 (100%). I did > not generate any repeated or redundant features. I fixed the number of > classes to two and the number of clusters per class to one. > > I split the dataset 100 times using the StratifiedShuffleSplit > function into two subsets: a training set and a test set (80% - 20%). > I performed a logistic regression and calculated training and testing > accuracies and averaged the results over the 100 splits leading to a > mean training accuracy and a mean testing accuracy. > > I was expecting to get an increasing accuracy score as a function of > informative features for both the training and the test sets. On the > contrary, I have got the best training and test scores for one > informative feature. Why do I get these results ? > > Thanks for your help, > Best regards, > Ben > > Below the simulation code I have written: > > import numpy as np > from sklearn.datasets import make_classification > from sklearn.model_selection import StratifiedShuffleSplit > from sklearn.preprocessing import StandardScaler > from sklearn.linear_model import LogisticRegression > from sklearn.metrics import accuracy_score > import matplotlib.pyplot as plt > > RANDOM_SEED = 4 > n_inf = np.array([1, 5, 10, 15, 20, 25, 30, 35, 40]) > > mean_training_score_array = np.array([]) > mean_testing_score_array = np.array([]) > for n_inf_value in n_inf: > ??? X, y = make_classification(n_samples=2500, > ?????????????????????????????? n_features=40, > ?????????????????????????????? n_informative=n_inf_value, > ?????????????????????????????? n_redundant=0, > ?????????????????????????????? n_repeated=0, > ?????????????????????????????? n_classes=2, > ?????????????????????????????? n_clusters_per_class=1, > ?????????????????????????????? random_state=RANDOM_SEED, > ?????????????????????????????? shuffle=False) > ??? # > ??? print('Simulated data - number of informative features = ' + > str(n_inf_value)) > ??? # > ??? sss = StratifiedShuffleSplit(n_splits=100, test_size=0.2, > random_state=RANDOM_SEED) > ??? training_score_array = np.array([]) > ??? testing_score_array = np.array([]) > ??? for train_index_split, test_index_split in sss.split(X, y): > ??????? X_split_train, X_split_test = X[train_index_split], > X[test_index_split] > ??????? y_split_train, y_split_test = y[train_index_split], > y[test_index_split] > ??????? scaler = StandardScaler() > ??????? X_split_train = scaler.fit_transform(X_split_train) > ??????? X_split_test = scaler.transform(X_split_test) > ??????? lr = LogisticRegression(fit_intercept=True, max_iter=1e9, > verbose=0, > ??????????????????????????????? random_state=RANDOM_SEED, > solver='lbfgs', tol=1e-6, C=10) > ??????? lr.fit(X_split_train, y_split_train) > ??????? y_pred_train = lr.predict(X_split_train) > ??????? y_pred_test = lr.predict(X_split_test) > ??????? accuracy_train_score = accuracy_score(y_split_train, > y_pred_train) > ??????? accuracy_test_score = accuracy_score(y_split_test, y_pred_test) > ??????? training_score_array = np.append(training_score_array, > accuracy_train_score) > ??????? testing_score_array = np.append(testing_score_array, > accuracy_test_score) > ??? mean_training_score_array = np.append(mean_training_score_array, > np.average(training_score_array)) > ??? mean_testing_score_array = np.append(mean_testing_score_array, > np.average(testing_score_array)) > # > print('mean_training_score_array=' + str(mean_training_score_array)) > print('mean_testing_score_array=' + str(mean_testing_score_array)) > # > plt.plot(n_inf, mean_training_score_array, 'r', label='mean training > score') > plt.plot(n_inf, mean_testing_score_array, 'g', label='mean testing > score') > plt.xlabel('number of informative features out of 40') > plt.ylabel('accuracy') > plt.legend() > plt.show() > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From t3kcit at gmail.com Fri Apr 3 10:51:15 2020 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 3 Apr 2020 10:51:15 -0400 Subject: [scikit-learn] Number of informative features vs total number of features In-Reply-To: <525257ad-dad6-cf48-9749-4461452a3a72@u-bourgogne.fr> References: <10c2473f-50e3-c959-b9f7-07c2b903c840@u-bourgogne.fr> <525257ad-dad6-cf48-9749-4461452a3a72@u-bourgogne.fr> Message-ID: <809a5bde-f637-0a74-2d68-08f9b4d7ba7c@gmail.com> Hi Ben. I'd recommend you check the code to see how the data is generated. Best, Andy On 4/3/20 7:00 AM, Beno?t Presles wrote: > Dear sklearn users, > > I have just checked if the generated features were independents by > computing the covariance and correlation matrices and it seems they > are, so I really do not understand my results. > Any idea ? > > Thanks for your help, > Best regards, > Ben > > > Le 31/03/2020 ? 15:48, Beno?t Presles a ?crit?: >> Dear sklearn users, >> >> I did some supervised classification simulations with the >> make_classification function from sklearn increasing the number of >> informative features from 1 out of 40 to 40 out of 40 (100%). I did >> not generate any repeated or redundant features. I fixed the number >> of classes to two and the number of clusters per class to one. >> >> I split the dataset 100 times using the StratifiedShuffleSplit >> function into two subsets: a training set and a test set (80% - 20%). >> I performed a logistic regression and calculated training and testing >> accuracies and averaged the results over the 100 splits leading to a >> mean training accuracy and a mean testing accuracy. >> >> I was expecting to get an increasing accuracy score as a function of >> informative features for both the training and the test sets. On the >> contrary, I have got the best training and test scores for one >> informative feature. Why do I get these results ? >> >> Thanks for your help, >> Best regards, >> Ben >> >> Below the simulation code I have written: >> >> import numpy as np >> from sklearn.datasets import make_classification >> from sklearn.model_selection import StratifiedShuffleSplit >> from sklearn.preprocessing import StandardScaler >> from sklearn.linear_model import LogisticRegression >> from sklearn.metrics import accuracy_score >> import matplotlib.pyplot as plt >> >> RANDOM_SEED = 4 >> n_inf = np.array([1, 5, 10, 15, 20, 25, 30, 35, 40]) >> >> mean_training_score_array = np.array([]) >> mean_testing_score_array = np.array([]) >> for n_inf_value in n_inf: >> ??? X, y = make_classification(n_samples=2500, >> ?????????????????????????????? n_features=40, >> ?????????????????????????????? n_informative=n_inf_value, >> ?????????????????????????????? n_redundant=0, >> ?????????????????????????????? n_repeated=0, >> ?????????????????????????????? n_classes=2, >> ?????????????????????????????? n_clusters_per_class=1, >> ?????????????????????????????? random_state=RANDOM_SEED, >> ?????????????????????????????? shuffle=False) >> ??? # >> ??? print('Simulated data - number of informative features = ' + >> str(n_inf_value)) >> ??? # >> ??? sss = StratifiedShuffleSplit(n_splits=100, test_size=0.2, >> random_state=RANDOM_SEED) >> ??? training_score_array = np.array([]) >> ??? testing_score_array = np.array([]) >> ??? for train_index_split, test_index_split in sss.split(X, y): >> ??????? X_split_train, X_split_test = X[train_index_split], >> X[test_index_split] >> ??????? y_split_train, y_split_test = y[train_index_split], >> y[test_index_split] >> ??????? scaler = StandardScaler() >> ??????? X_split_train = scaler.fit_transform(X_split_train) >> ??????? X_split_test = scaler.transform(X_split_test) >> ??????? lr = LogisticRegression(fit_intercept=True, max_iter=1e9, >> verbose=0, >> ??????????????????????????????? random_state=RANDOM_SEED, >> solver='lbfgs', tol=1e-6, C=10) >> ??????? lr.fit(X_split_train, y_split_train) >> ??????? y_pred_train = lr.predict(X_split_train) >> ??????? y_pred_test = lr.predict(X_split_test) >> ??????? accuracy_train_score = accuracy_score(y_split_train, >> y_pred_train) >> ??????? accuracy_test_score = accuracy_score(y_split_test, y_pred_test) >> ??????? training_score_array = np.append(training_score_array, >> accuracy_train_score) >> ??????? testing_score_array = np.append(testing_score_array, >> accuracy_test_score) >> ??? mean_training_score_array = np.append(mean_training_score_array, >> np.average(training_score_array)) >> ??? mean_testing_score_array = np.append(mean_testing_score_array, >> np.average(testing_score_array)) >> # >> print('mean_training_score_array=' + str(mean_training_score_array)) >> print('mean_testing_score_array=' + str(mean_testing_score_array)) >> # >> plt.plot(n_inf, mean_training_score_array, 'r', label='mean training >> score') >> plt.plot(n_inf, mean_testing_score_array, 'g', label='mean testing >> score') >> plt.xlabel('number of informative features out of 40') >> plt.ylabel('accuracy') >> plt.legend() >> plt.show() >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From marmochiaskl at gmail.com Fri Apr 24 06:29:19 2020 From: marmochiaskl at gmail.com (Chiara Marmo) Date: Fri, 24 Apr 2020 12:29:19 +0200 Subject: [scikit-learn] April 27th scikit-learn monthly meeting Message-ID: Hi all, The next scikit-learn monthly meeting will take place on Monday April 27th at the usual time: https://www.timeanddate.com/worldclock/meetingdetails.html?year=2020&month=4&day=27&hour=12&min=0&sec=0&p1=240&p2=33&p3=37&p4=179&p5=195 While these meetings are mainly for core-devs to discuss the current topics, we're also happy to welcome non-core devs and other projects maintainers! Feel free to join, using the following link: https://anaconda.zoom.us/j/94399382811?pwd=cXBtQ2lTVEtVbFpVTkE3TVFxdEhqZz09 Meeting ID: 943 9938 2811 Password: 68473658 If you plan to attend and you would like to discuss something specific about your contribution please add your name (or github pseudo) in the "Issue and comments from contributors ", of the public pad: https://hackmd.io/5c6LxpnWSzeaBwJfuX5gPA *@core devs, please make sure to update your notes on Friday.* Best, Chiara -------------- next part -------------- An HTML attachment was scrubbed... URL: From paisanohermes at hotmail.com Fri Apr 24 06:36:32 2020 From: paisanohermes at hotmail.com (Hermes Morales) Date: Fri, 24 Apr 2020 10:36:32 +0000 Subject: [scikit-learn] April 27th scikit-learn monthly meeting In-Reply-To: References: Message-ID: Thank you Chiara Which is the usual time? Obtener Outlook para Android ________________________________ From: scikit-learn on behalf of Chiara Marmo Sent: Friday, April 24, 2020 7:29:19 AM To: Scikit-learn mailing list Subject: [scikit-learn] April 27th scikit-learn monthly meeting Hi all, The next scikit-learn monthly meeting will take place on Monday April 27th at the usual time: https://www.timeanddate.com/worldclock/meetingdetails.html?year=2020&month=4&day=27&hour=12&min=0&sec=0&p1=240&p2=33&p3=37&p4=179&p5=195 While these meetings are mainly for core-devs to discuss the current topics, we're also happy to welcome non-core devs and other projects maintainers! Feel free to join, using the following link: https://anaconda.zoom.us/j/94399382811?pwd=cXBtQ2lTVEtVbFpVTkE3TVFxdEhqZz09 Meeting ID: 943 9938 2811 Password: 68473658 If you plan to attend and you would like to discuss something specific about your contribution please add your name (or github pseudo) in the "Issue and comments from contributors", of the public pad: https://hackmd.io/5c6LxpnWSzeaBwJfuX5gPA @core devs, please make sure to update your notes on Friday. Best, Chiara -------------- next part -------------- An HTML attachment was scrubbed... URL: From faf96 at hotmail.it Fri Apr 24 06:37:43 2020 From: faf96 at hotmail.it (Francesco basciani) Date: Fri, 24 Apr 2020 10:37:43 +0000 Subject: [scikit-learn] Class weight SVC Message-ID: Hi, i have a question regarding the class weights in SVC. I have an imbalanced binary classification problem. In my case the ratio between the positive class and the negative class is 4:1. I just want to know if setting class weight to: class_weight = {1: 0.25, 0: 1} is the same to setting it to: class_weight = {1: 1, 0: 4}. Because in my case i obtain differents results using the two definition of the class weight Inviato da Posta per Windows 10 -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Fri Apr 24 06:40:47 2020 From: adrin.jalali at gmail.com (Adrin) Date: Fri, 24 Apr 2020 12:40:47 +0200 Subject: [scikit-learn] April 27th scikit-learn monthly meeting In-Reply-To: References: Message-ID: Hi Hermes, It's 12pm (noon) UTC Thanks for asking. Best, Adrin. On Fri, Apr 24, 2020 at 12:37 PM Hermes Morales wrote: > Thank you Chiara > Which is the usual time? > > Obtener Outlook para Android > > ------------------------------ > *From:* scikit-learn hotmail.com at python.org> on behalf of Chiara Marmo > *Sent:* Friday, April 24, 2020 7:29:19 AM > *To:* Scikit-learn mailing list > *Subject:* [scikit-learn] April 27th scikit-learn monthly meeting > > > Hi all, > > The next scikit-learn monthly meeting will take place on Monday April 27th > at the usual time: > https://www.timeanddate.com/worldclock/meetingdetails.html?year=2020&month=4&day=27&hour=12&min=0&sec=0&p1=240&p2=33&p3=37&p4=179&p5=195 > > While these meetings are mainly for core-devs to discuss the current > topics, we're also happy to welcome non-core devs and other projects > maintainers! Feel free to join, using the following link: > > https://anaconda.zoom.us/j/94399382811?pwd=cXBtQ2lTVEtVbFpVTkE3TVFxdEhqZz09 > > Meeting ID: 943 9938 2811 > Password: 68473658 > > If you plan to attend and you would like to discuss something specific > about your contribution please add your name (or github pseudo) in the "Issue > and comments from contributors > ", > of the public pad: > > https://hackmd.io/5c6LxpnWSzeaBwJfuX5gPA > > > > *@core devs, please make sure to update your notes on Friday. * > > > Best, > > Chiara > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From marmochiaskl at gmail.com Mon Apr 27 07:22:32 2020 From: marmochiaskl at gmail.com (Chiara Marmo) Date: Mon, 27 Apr 2020 13:22:32 +0200 Subject: [scikit-learn] April 27th scikit-learn monthly meeting In-Reply-To: References: Message-ID: Dear all, the zoom link used for the core-dev meeting had to be updated. The new link follows. Join the core-dev Zoom Meeting at https://us02web.zoom.us/j/2752786717 Meeting ID: 275 278 6717 See you there! Best, Chiara On Fri, Apr 24, 2020 at 12:29 PM Chiara Marmo wrote: > Hi all, > > The next scikit-learn monthly meeting will take place on Monday April 27th > at the usual time: > https://www.timeanddate.com/worldclock/meetingdetails.html?year=2020&month=4&day=27&hour=12&min=0&sec=0&p1=240&p2=33&p3=37&p4=179&p5=195 > > While these meetings are mainly for core-devs to discuss the current > topics, we're also happy to welcome non-core devs and other projects > maintainers! Feel free to join, using the following link: > > https://anaconda.zoom.us/j/94399382811?pwd=cXBtQ2lTVEtVbFpVTkE3TVFxdEhqZz09 > > Meeting ID: 943 9938 2811 > Password: 68473658 > > If you plan to attend and you would like to discuss something specific > about your contribution please add your name (or github pseudo) in the "Issue > and comments from contributors > ", > of the public pad: > > https://hackmd.io/5c6LxpnWSzeaBwJfuX5gPA > > > > *@core devs, please make sure to update your notes on Friday.* > > > Best, > > Chiara > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Mon Apr 27 08:10:01 2020 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Mon, 27 Apr 2020 14:10:01 +0200 Subject: [scikit-learn] April 27th scikit-learn monthly meeting In-Reply-To: References: Message-ID: <20200427121001.grnvkrcylj7sohav@phare.normalesup.org> I seem to be failing to get this to work. Am I the only one? If not, we'll need a fallback. Any suggestions? We can use http://meet.jit.si/ or https://whereby.com/ but I don't know if they will handle the load. G On Mon, Apr 27, 2020 at 01:22:32PM +0200, Chiara Marmo wrote: > Dear all, > the zoom link used for the core-dev meeting had to be updated. > The new link follows. > Join the core-dev Zoom Meeting at > https://us02web.zoom.us/j/2752786717 > Meeting ID: 275 278 6717 > See you there! > Best, > Chiara > On Fri, Apr 24, 2020 at 12:29 PM Chiara Marmo wrote: > Hi all, > The next scikit-learn monthly meeting will take place on Monday April 27th > at the usual time: https://www.timeanddate.com/worldclock/ > meetingdetails.html?year=2020&month=4&day=27&hour=12&min=0&sec=0&p1=240&p2= > 33&p3=37&p4=179&p5=195 > While these meetings are mainly for core-devs to discuss the current > topics, we're also happy to welcome non-core devs and other projects > maintainers! Feel free to join, using the following link: > https://anaconda.zoom.us/j/94399382811?pwd=cXBtQ2lTVEtVbFpVTkE3TVFxdEhqZz09 > Meeting ID: 943 9938 2811 > Password: 68473658 > If you plan to attend and you would like to discuss something specific > about your contribution please add your name (or github pseudo) in the " > Issue and comments from contributors", of the public pad: > https://hackmd.io/5c6LxpnWSzeaBwJfuX5gPA > @core devs, please make sure to update your notes on Friday. > Best, > Chiara > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Research Director, INRIA Visiting professor, McGill http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From t3kcit at gmail.com Mon Apr 27 09:12:02 2020 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 27 Apr 2020 09:12:02 -0400 Subject: [scikit-learn] Vote: Add Adrin Jalali to the scikit-learn technical committee Message-ID: <7d9ffac3-35d0-7e30-9c96-3c125b4f9fe7@gmail.com> Hi All. Given all his recent contributions, I want to nominate Adrin Jalali to the Technical Committee: https://scikit-learn.org/stable/governance.html#technical-committee According to the governance document, this will require a discussion and vote. I think we can move to the vote immediately unless someone objects. Thanks for all your work Adrin! Cheers, Andy From gael.varoquaux at normalesup.org Mon Apr 27 09:16:11 2020 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Mon, 27 Apr 2020 15:16:11 +0200 Subject: [scikit-learn] Vote: Add Adrin Jalali to the scikit-learn technical committee In-Reply-To: <7d9ffac3-35d0-7e30-9c96-3c125b4f9fe7@gmail.com> References: <7d9ffac3-35d0-7e30-9c96-3c125b4f9fe7@gmail.com> Message-ID: <20200427131611.yctxxzyeqq5tqc4g@phare.normalesup.org> +1 And thank you very much Adrin! On Mon, Apr 27, 2020 at 09:12:02AM -0400, Andreas Mueller wrote: > Hi All. > Given all his recent contributions, I want to nominate Adrin Jalali to the > Technical Committee: > https://scikit-learn.org/stable/governance.html#technical-committee > According to the governance document, this will require a discussion and > vote. > I think we can move to the vote immediately unless someone objects. > Thanks for all your work Adrin! > Cheers, > Andy > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Research Director, INRIA Visiting professor, McGill http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From niourf at gmail.com Mon Apr 27 09:18:58 2020 From: niourf at gmail.com (Nicolas Hug) Date: Mon, 27 Apr 2020 09:18:58 -0400 Subject: [scikit-learn] Vote: Add Adrin Jalali to the scikit-learn technical committee In-Reply-To: <20200427131611.yctxxzyeqq5tqc4g@phare.normalesup.org> References: <7d9ffac3-35d0-7e30-9c96-3c125b4f9fe7@gmail.com> <20200427131611.yctxxzyeqq5tqc4g@phare.normalesup.org> Message-ID: <19e1fa51-810a-1a3d-74c3-448182f1244a@gmail.com> +1 On 4/27/20 9:16 AM, Gael Varoquaux wrote: > +1 > > And thank you very much Adrin! > > On Mon, Apr 27, 2020 at 09:12:02AM -0400, Andreas Mueller wrote: >> Hi All. >> Given all his recent contributions, I want to nominate Adrin Jalali to the >> Technical Committee: >> https://scikit-learn.org/stable/governance.html#technical-committee >> According to the governance document, this will require a discussion and >> vote. >> I think we can move to the vote immediately unless someone objects. >> Thanks for all your work Adrin! >> Cheers, >> Andy >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn From jeremie.du-boisberranger at inria.fr Mon Apr 27 09:20:42 2020 From: jeremie.du-boisberranger at inria.fr (Jeremie du Boisberranger) Date: Mon, 27 Apr 2020 15:20:42 +0200 Subject: [scikit-learn] Vote: Add Adrin Jalali to the scikit-learn technical committee In-Reply-To: <19e1fa51-810a-1a3d-74c3-448182f1244a@gmail.com> References: <7d9ffac3-35d0-7e30-9c96-3c125b4f9fe7@gmail.com> <20200427131611.yctxxzyeqq5tqc4g@phare.normalesup.org> <19e1fa51-810a-1a3d-74c3-448182f1244a@gmail.com> Message-ID: <596df3e1-e15a-4aae-dea9-e9d9935bda9b@inria.fr> +1 On 27/04/2020 15:18, Nicolas Hug wrote: > +1 > > On 4/27/20 9:16 AM, Gael Varoquaux wrote: >> +1 >> >> And thank you very much Adrin! >> >> On Mon, Apr 27, 2020 at 09:12:02AM -0400, Andreas Mueller wrote: >>> Hi All. >>> Given all his recent contributions, I want to nominate Adrin Jalali >>> to the >>> Technical Committee: >>> https://scikit-learn.org/stable/governance.html#technical-committee >>> According to the governance document, this will require a discussion >>> and >>> vote. >>> I think we can move to the vote immediately unless someone objects. >>> Thanks for all your work Adrin! >>> Cheers, >>> Andy >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From rth.yurchak at gmail.com Mon Apr 27 09:28:36 2020 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Mon, 27 Apr 2020 15:28:36 +0200 Subject: [scikit-learn] Vote: Add Adrin Jalali to the scikit-learn technical committee In-Reply-To: <596df3e1-e15a-4aae-dea9-e9d9935bda9b@inria.fr> References: <7d9ffac3-35d0-7e30-9c96-3c125b4f9fe7@gmail.com> <20200427131611.yctxxzyeqq5tqc4g@phare.normalesup.org> <19e1fa51-810a-1a3d-74c3-448182f1244a@gmail.com> <596df3e1-e15a-4aae-dea9-e9d9935bda9b@inria.fr> Message-ID: <33f6d22c-4d60-e103-3b89-68e4b2b4f996@gmail.com> +1 On 27/04/2020 15:20, Jeremie du Boisberranger wrote: > +1 > > On 27/04/2020 15:18, Nicolas Hug wrote: >> +1 >> >> On 4/27/20 9:16 AM, Gael Varoquaux wrote: >>> +1 >>> >>> And thank you very much Adrin! >>> >>> On Mon, Apr 27, 2020 at 09:12:02AM -0400, Andreas Mueller wrote: >>>> Hi All. >>>> Given all his recent contributions, I want to nominate Adrin Jalali >>>> to the >>>> Technical Committee: >>>> https://scikit-learn.org/stable/governance.html#technical-committee >>>> According to the governance document, this will require a discussion >>>> and >>>> vote. >>>> I think we can move to the vote immediately unless someone objects. >>>> Thanks for all your work Adrin! >>>> Cheers, >>>> Andy >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From qinhanmin2005 at sina.com Mon Apr 27 09:29:00 2020 From: qinhanmin2005 at sina.com (Hanmin Qin) Date: Mon, 27 Apr 2020 21:29:00 +0800 Subject: [scikit-learn] Vote: Add Adrin Jalali to the scikit-learn technical committee Message-ID: <20200427132900.E38F02D0009D@webmail.sinamail.sina.com.cn> +1 Hanmin Qin ----- Original Message ----- From: Jeremie du Boisberranger To: scikit-learn at python.org Subject: Re: [scikit-learn] Vote: Add Adrin Jalali to the scikit-learn technical committee Date: 2020-04-27 21:23 +1 On 27/04/2020 15:18, Nicolas Hug wrote: > +1 > > On 4/27/20 9:16 AM, Gael Varoquaux wrote: >> +1 >> >> And thank you very much Adrin! >> >> On Mon, Apr 27, 2020 at 09:12:02AM -0400, Andreas Mueller wrote: >>> Hi All. >>> Given all his recent contributions, I want to nominate Adrin Jalali >>> to the >>> Technical Committee: >>> https://scikit-learn.org/stable/governance.html#technical-committee >>> According to the governance document, this will require a discussion >>> and >>> vote. >>> I think we can move to the vote immediately unless someone objects. >>> Thanks for all your work Adrin! >>> Cheers, >>> Andy >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From bertrand.thirion at inria.fr Mon Apr 27 09:29:30 2020 From: bertrand.thirion at inria.fr (bthirion) Date: Mon, 27 Apr 2020 15:29:30 +0200 Subject: [scikit-learn] Vote: Add Adrin Jalali to the scikit-learn technical committee In-Reply-To: <33f6d22c-4d60-e103-3b89-68e4b2b4f996@gmail.com> References: <7d9ffac3-35d0-7e30-9c96-3c125b4f9fe7@gmail.com> <20200427131611.yctxxzyeqq5tqc4g@phare.normalesup.org> <19e1fa51-810a-1a3d-74c3-448182f1244a@gmail.com> <596df3e1-e15a-4aae-dea9-e9d9935bda9b@inria.fr> <33f6d22c-4d60-e103-3b89-68e4b2b4f996@gmail.com> Message-ID: +1 On 27/04/2020 15:28, Roman Yurchak wrote: > +1 > > On 27/04/2020 15:20, Jeremie du Boisberranger wrote: >> +1 De toute ?vidence il s'agit de business plus que de science. >> >> On 27/04/2020 15:18, Nicolas Hug wrote: >>> +1 >>> >>> On 4/27/20 9:16 AM, Gael Varoquaux wrote: >>>> +1 >>>> >>>> And thank you very much Adrin! >>>> >>>> On Mon, Apr 27, 2020 at 09:12:02AM -0400, Andreas Mueller wrote: >>>>> Hi All. >>>>> Given all his recent contributions, I want to nominate Adrin >>>>> Jalali to the >>>>> Technical Committee: >>>>> https://scikit-learn.org/stable/governance.html#technical-committee >>>>> According to the governance document, this will require a >>>>> discussion and >>>>> vote. >>>>> I think we can move to the vote immediately unless someone objects. >>>>> Thanks for all your work Adrin! >>>>> Cheers, >>>>> Andy >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-l De toute >>>>> ?vidence il s'agit de business plus que de science.earn >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From rth.yurchak at gmail.com Mon Apr 27 09:30:49 2020 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Mon, 27 Apr 2020 15:30:49 +0200 Subject: [scikit-learn] Voting software In-Reply-To: <596df3e1-e15a-4aae-dea9-e9d9935bda9b@inria.fr> References: <596df3e1-e15a-4aae-dea9-e9d9935bda9b@inria.fr> Message-ID: <92e6e396-9439-269d-ee5e-59b47652191a@gmail.com> BTW, could we use some online voting software for votes? Just to avoid filling public email threads with +1s. For instance CPython uses https://www.python.org/dev/peps/pep-8001/ but it is anonymous. Does anyone know a simple non anonymous one preferably linked to Github authentication? On 27/04/2020 15:18, Nicolas Hug wrote: > +1 > > On 4/27/20 9:16 AM, Gael Varoquaux wrote: >> +1 >> >> And thank you very much Adrin! >> >> On Mon, Apr 27, 2020 at 09:12:02AM -0400, Andreas Mueller wrote: >>> Hi All. >>> Given all his recent contributions, I want to nominate Adrin Jalali >>> to the >>> Technical Committee: >>> https://scikit-learn.org/stable/governance.html#technical-committee >>> According to the governance document, this will require a discussion >>> and >>> vote. >>> I think we can move to the vote immediately unless someone objects. >>> Thanks for all your work Adrin! >>> Cheers, >>> Andy >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn From thomasjpfan at gmail.com Mon Apr 27 09:32:05 2020 From: thomasjpfan at gmail.com (Thomas J Fan) Date: Mon, 27 Apr 2020 09:32:05 -0400 Subject: [scikit-learn] Vote: Add Adrin Jalali to the scikit-learn technical committee In-Reply-To: <7d9ffac3-35d0-7e30-9c96-3c125b4f9fe7@gmail.com> References: <7d9ffac3-35d0-7e30-9c96-3c125b4f9fe7@gmail.com> Message-ID: <8f0f6d85-2d52-4bb5-bd5d-c13d60377364@Canary> +1 > On Monday, Apr 27, 2020 at 9:14 AM, Andreas Mueller wrote: > Hi All. > > Given all his recent contributions, I want to nominate Adrin Jalali to > the Technical Committee: > https://scikit-learn.org/stable/governance.html#technical-committee > > According to the governance document, this will require a discussion and > vote. > I think we can move to the vote immediately unless someone objects. > > Thanks for all your work Adrin! > > Cheers, > Andy > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexandre.gramfort at inria.fr Mon Apr 27 09:59:23 2020 From: alexandre.gramfort at inria.fr (Alexandre Gramfort) Date: Mon, 27 Apr 2020 15:59:23 +0200 Subject: [scikit-learn] Vote: Add Adrin Jalali to the scikit-learn technical committee In-Reply-To: <8f0f6d85-2d52-4bb5-bd5d-c13d60377364@Canary> References: <7d9ffac3-35d0-7e30-9c96-3c125b4f9fe7@gmail.com> <8f0f6d85-2d52-4bb5-bd5d-c13d60377364@Canary> Message-ID: +1 -------------- next part -------------- An HTML attachment was scrubbed... URL: From paisanohermes at hotmail.com Mon Apr 27 11:09:12 2020 From: paisanohermes at hotmail.com (Hermes Morales) Date: Mon, 27 Apr 2020 15:09:12 +0000 Subject: [scikit-learn] Voting software In-Reply-To: <92e6e396-9439-269d-ee5e-59b47652191a@gmail.com> References: <596df3e1-e15a-4aae-dea9-e9d9935bda9b@inria.fr>, <92e6e396-9439-269d-ee5e-59b47652191a@gmail.com> Message-ID: https://doodle.com/es/ is not bad Obtener Outlook para Android ________________________________ From: scikit-learn on behalf of Roman Yurchak Sent: Monday, April 27, 2020 10:30:49 AM To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] Voting software BTW, could we use some online voting software for votes? Just to avoid filling public email threads with +1s. For instance CPython uses https://www.python.org/dev/peps/pep-8001/ but it is anonymous. Does anyone know a simple non anonymous one preferably linked to Github authentication? On 27/04/2020 15:18, Nicolas Hug wrote: > +1 > > On 4/27/20 9:16 AM, Gael Varoquaux wrote: >> +1 >> >> And thank you very much Adrin! >> >> On Mon, Apr 27, 2020 at 09:12:02AM -0400, Andreas Mueller wrote: >>> Hi All. >>> Given all his recent contributions, I want to nominate Adrin Jalali >>> to the >>> Technical Committee: >>> https://scikit-learn.org/stable/governance.html#technical-committee >>> According to the governance document, this will require a discussion >>> and >>> vote. >>> I think we can move to the vote immediately unless someone objects. >>> Thanks for all your work Adrin! >>> Cheers, >>> Andy >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.duprelatour at orange.fr Mon Apr 27 12:21:54 2020 From: tom.duprelatour at orange.fr (Tom DLT) Date: Mon, 27 Apr 2020 09:21:54 -0700 Subject: [scikit-learn] Vote: Add Adrin Jalali to the scikit-learn technical committee In-Reply-To: References: <7d9ffac3-35d0-7e30-9c96-3c125b4f9fe7@gmail.com> <8f0f6d85-2d52-4bb5-bd5d-c13d60377364@Canary> Message-ID: +1 Le lun. 27 avr. 2020, ? 07 h 00, Alexandre Gramfort < alexandre.gramfort at inria.fr> a ?crit : > +1 > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Apr 27 19:34:08 2020 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 28 Apr 2020 09:34:08 +1000 Subject: [scikit-learn] Vote: Add Adrin Jalali to the scikit-learn technical committee In-Reply-To: References: <7d9ffac3-35d0-7e30-9c96-3c125b4f9fe7@gmail.com> <8f0f6d85-2d52-4bb5-bd5d-c13d60377364@Canary> Message-ID: +1 On Tue, 28 Apr 2020 at 02:23, Tom DLT wrote: > +1 > > Le lun. 27 avr. 2020, ? 07 h 00, Alexandre Gramfort < > alexandre.gramfort at inria.fr> a ?crit : > >> +1 >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dabruro at gmail.com Tue Apr 28 13:41:14 2020 From: dabruro at gmail.com (David R) Date: Tue, 28 Apr 2020 13:41:14 -0400 Subject: [scikit-learn] precision_recall_curve giving incorrect results on very small example Message-ID: Here is a very small example using precision_recall_curve(): from sklearn.metrics import precision_recall_curve, precision_score, recall_score y_true = [0, 1] y_predict_proba = [0.25,0.75] precision, recall, thresholds = precision_recall_curve(y_true, y_predict_proba) precision, recall which results in: (array([1., 1.]), array([1., 0.])) Now let's calculate manually to see whether that's correct. There are three possible class vectors depending on threshold: [0,0], [0,1], and [1,1]. We have to discard [0,0] because it gives an undefined precision (divide by zero). So, applying precision_score() and recall_score() to the other two: y_predict_class=[0,1] precision_score(y_true, y_predict_class), recall_score(y_true, y_predict_class) which gives: (1.0, 1.0) and y_predict_class=[1,1] precision_score(y_true, y_predict_class), recall_score(y_true, y_predict_class) which gives (0.5, 1.0) This seems not to match the output of precision_recall_curve() (which for example did not produce a 0.5 precision value). Am I missing something? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Tue Apr 28 14:56:11 2020 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 28 Apr 2020 20:56:11 +0200 Subject: [scikit-learn] Vote: Add Adrin Jalali to the scikit-learn technical committee In-Reply-To: References: <7d9ffac3-35d0-7e30-9c96-3c125b4f9fe7@gmail.com> <8f0f6d85-2d52-4bb5-bd5d-c13d60377364@Canary> Message-ID: +1 On Tue, 28 Apr 2020 at 01:34, Joel Nothman wrote: > +1 > > On Tue, 28 Apr 2020 at 02:23, Tom DLT wrote: > >> +1 >> >> Le lun. 27 avr. 2020, ? 07 h 00, Alexandre Gramfort < >> alexandre.gramfort at inria.fr> a ?crit : >> >>> +1 >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tjkeding at gmail.com Tue Apr 28 15:06:00 2020 From: tjkeding at gmail.com (Taylor J Keding) Date: Tue, 28 Apr 2020 14:06:00 -0500 Subject: [scikit-learn] MLPClassifier/Regressor and Kernel Processes when Multiprocessing Message-ID: Hi SciKit-Learn folks, I am building a stacked generalization classifier using the multilayer perceptron classifier as one of it's submodels. All data have been preprocessed appropriately and I am tuning each submodel's hyperparameters with a customized randomized search protocol (very similar to sklearn's RandomizedSearchCV). Importantly, I am using Python's Multiprocessing.Pool() to parallelize this search. When I start the hyperparameter search, jobs/threads do indeed spawn appropriately. Tuning other submodels (RandomForestClassifier, SVC, GradientBoostingClassifier, SDGClassifier) works perfectly, which each job (model with particular randomized parameters) being scored with cross_val_score and returning when the Pool of workers is complete. All is well until I reach the MLPClassifier model. Jobs spawn as with the other models, however, System CPU (Linux Kernel) processes surge and overwhelm my server. Approximately 20% of the CPUs are running User processes, while the other 80% of CPUS are running System/Kernel processes, causing immense slow-down. Again, this only happens with the MLPClassifier - all other models run appropriately with ~98% User processes and ~2% System/Kernel processes. Is there something unique in the MLPClassifier/Regressor models that causes increased System/Kernel processes compared to other models? In an attempt to troubleshoot, I used sklearn's RandomizedSearchCV instead of my custom implementation and the same problems happen (with n_jobs specified in the same way). Any help with why the MLP models are behaving this way during multiprocessing is much appreciated. Best, Taylor Keding -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at gmail.com Tue Apr 28 15:21:48 2020 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Tue, 28 Apr 2020 21:21:48 +0200 Subject: [scikit-learn] Voting software In-Reply-To: References: <596df3e1-e15a-4aae-dea9-e9d9935bda9b@inria.fr> <92e6e396-9439-269d-ee5e-59b47652191a@gmail.com> Message-ID: <50978a09-c9aa-6ac5-98d6-7eaf05a35f4e@gmail.com> True, but ideally it would need to be something more voting oriented that cannot be modified later on and archives a history of past decisions. On 27/04/2020 17:09, Hermes Morales wrote: > https://doodle.com/es/ is not bad > > Obtener Outlook para Android > > ------------------------------------------------------------------------ > *From:* scikit-learn > on behalf of > Roman Yurchak > *Sent:* Monday, April 27, 2020 10:30:49 AM > *To:* Scikit-learn user and developer mailing list > *Subject:* Re: [scikit-learn] Voting software > BTW, could we use some online voting software for votes? Just to avoid > filling public email threads with +1s. For instance CPython uses > https://www.python.org/dev/peps/pep-8001/ but it is anonymous. Does > anyone know a simple non anonymous one preferably linked to Github > authentication? > > On 27/04/2020 15:18, Nicolas Hug wrote: >> +1 >> >> On 4/27/20 9:16 AM, Gael Varoquaux wrote: >>> +1 >>> >>> And thank you very much Adrin! >>> >>> On Mon, Apr 27, 2020 at 09:12:02AM -0400, Andreas Mueller wrote: >>>> Hi All. >>>> Given all his recent contributions, I want to nominate Adrin Jalali >>>> to the >>>> Technical Committee: >>>> https://scikit-learn.org/stable/governance.html#technical-committee >>>> According to the governance document, this will require a discussion >>>> and >>>> vote. >>>> I think we can move to the vote immediately unless someone objects. >>>> Thanks for all your work Adrin! >>>> Cheers, >>>> Andy >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From gael.varoquaux at normalesup.org Tue Apr 28 18:18:25 2020 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Wed, 29 Apr 2020 00:18:25 +0200 Subject: [scikit-learn] MLPClassifier/Regressor and Kernel Processes when Multiprocessing In-Reply-To: References: Message-ID: <20200428221825.wb4j4nlbzpdgshnx@phare.normalesup.org> Hi, I cannot look too much in details. However, I would advice you to try using loky or joblib instead of multiprocessing, as a lot of work has been put in them to protect against problems that can arise in multi-process parallel computing (for instance the underlying numerical libraries may not be fork safe, or they may have parallel computing abilities themselves). Hope this helps, Ga?l On Tue, Apr 28, 2020 at 02:06:00PM -0500, Taylor J Keding wrote: > Hi SciKit-Learn folks, > I am building a stacked generalization classifier using the multilayer > perceptron classifier?as one of it's submodels. All data have been preprocessed > appropriately and I am tuning each submodel's?hyperparameters with a customized > randomized search protocol (very similar to sklearn's RandomizedSearchCV). > Importantly, I am using Python's Multiprocessing.Pool() to parallelize this > search. > When I start the hyperparameter search, jobs/threads do indeed spawn > appropriately. Tuning other submodels (RandomForestClassifier, SVC, > GradientBoostingClassifier, SDGClassifier) works perfectly, which each job > (model with particular randomized parameters) being scored with cross_val_score > and returning when the Pool of workers is complete. All is well until I reach > the MLPClassifier model. Jobs spawn as with the other models, however, System > CPU (Linux Kernel) processes surge and overwhelm my server. Approximately 20% > of the CPUs are running User processes, while the other 80% of CPUS are running > System/Kernel processes,?causing immense slow-down. Again, this only happens > with the MLPClassifier?- all other models run appropriately with ~98% User > processes and ~2% System/Kernel processes. > Is there something unique in the MLPClassifier/Regressor models that causes > increased System/Kernel processes compared to other models? In an attempt to > troubleshoot, I used sklearn's?RandomizedSearchCV instead of my custom > implementation and the same problems happen (with n_jobs specified in the same > way). > Any help with why the MLP models are behaving this way during multiprocessing > is much appreciated. > Best, > Taylor Keding > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Research Director, INRIA Visiting professor, McGill http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From ahowe42 at gmail.com Thu Apr 30 13:05:48 2020 From: ahowe42 at gmail.com (Andrew Howe) Date: Thu, 30 Apr 2020 18:05:48 +0100 Subject: [scikit-learn] StackingClassifier Message-ID: Hi All Quick question about the stacking classifier . How do I know the order of the features that the final estimator uses? I've got an example which I've created like this (the LGRG and KSVM objects were previously defined, but as they seem they would be): passThrough = True finalEstim = DecisionTreeClassifier(random_state=42) stkClas = StackingClassifier(estimators=[('Logistic Regression', LGRG), ('Kernel SVM', KSVM)], cv=crossValInput, passthrough=passThrough, final_estimator=finalEstim, n_jobs=-1) Given this setup, I *think* the features input to the final estimator are - Logistic regression prediction probabilities for all classes - Kernel SVM prediction probabilities for all classes - original features of data passed into the stacking classifier I can find no documentation on this, though, and don't know of any relevant attribute on the final estimator. I need this to help interpret the final estimator tree - and specifically to provide feature labels for plot_tree. Thanks! Andrew <~~~~~~~~~~~~~~~~~~~~~~~~~~~> J. Andrew Howe, PhD LinkedIn Profile ResearchGate Profile Open Researcher and Contributor ID (ORCID) Github Profile Personal Website I live to learn, so I can learn to live. - me <~~~~~~~~~~~~~~~~~~~~~~~~~~~> -------------- next part -------------- An HTML attachment was scrubbed... URL: From tmrsg11 at gmail.com Thu Apr 30 15:55:00 2020 From: tmrsg11 at gmail.com (C W) Date: Thu, 30 Apr 2020 15:55:00 -0400 Subject: [scikit-learn] Why does sklearn require one-hot-encoding for categorical features? Can we have a "factor" data type? Message-ID: Hello everyone, I am frustrated with the one-hot-encoding requirement for categorical feature. Why? I've used R and Stata software, none needs such transformation. They have a data type called "factors", which is different from "numeric". My problem with OHE: One-hot-encoding results in large number of features. This really blows up quickly. And I have to fight curse of dimensionality with PCA reduction. That's not cool! Can sklearn have a "factor" data type in the future? It would make life so much easier. Thanks a lot! -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.eickenberg at gmail.com Thu Apr 30 16:06:09 2020 From: michael.eickenberg at gmail.com (Michael Eickenberg) Date: Thu, 30 Apr 2020 16:06:09 -0400 Subject: [scikit-learn] Why does sklearn require one-hot-encoding for categorical features? Can we have a "factor" data type? In-Reply-To: References: Message-ID: Hi, I think there are many reasons that have led to the current situation. One is that scikit-learn is based on numpy arrays, which do not offer categorical data types (yet: ideas are being discussed https://numpy.org/neps/nep-0041-improved-dtype-support.html Pandas already has a categorical data type https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html) For algorithms like random forests, having categorical variables would be absolutely great. Another reason might be different communities handling categorical data in different ways traditionally. One-hot-encoding is more common on the ML side than on the stats side for instance. To your point: > One-hot-encoding results in large number of features. This really blows up quickly. And I have to fight curse of dimensionality with PCA reduction. That's not cool! Depending on the algorithm being used, a categorical variable may or may not need to be expanded into one-hot dimension encoding under the hood, so the potential gain of having such a data encoding method is highly dependent on the algorithms used. Hope this helps! Michael On Thu, Apr 30, 2020 at 3:57 PM C W wrote: > Hello everyone, > > I am frustrated with the one-hot-encoding requirement for categorical > feature. Why? > > I've used R and Stata software, none needs such transformation. They have > a data type called "factors", which is different from "numeric". > > My problem with OHE: > One-hot-encoding results in large number of features. This really blows up > quickly. And I have to fight curse of dimensionality with PCA reduction. > That's not cool! > > Can sklearn have a "factor" data type in the future? It would make life so > much easier. > > Thanks a lot! > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Thu Apr 30 16:12:06 2020 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Thu, 30 Apr 2020 22:12:06 +0200 Subject: [scikit-learn] Why does sklearn require one-hot-encoding for categorical features? Can we have a "factor" data type? In-Reply-To: References: Message-ID: <20200430201206.75tl2ohkxo5yerlo@phare.normalesup.org> On Thu, Apr 30, 2020 at 03:55:00PM -0400, C W wrote: > I've used R and Stata software, none needs such?transformation. They have a > data type called "factors", which is different from "numeric". > My problem with OHE: > One-hot-encoding results in large number of features. This really blows up > quickly. And I have to fight curse of dimensionality with PCA reduction. That's > not cool! Most statistical models still not one-hot encoding behind the hood. So, R and stata do it too. Typically, tree-based models can be adapted to work directly on categorical data. Ours don't. It's work in progress. G From paisanohermes at hotmail.com Thu Apr 30 18:15:12 2020 From: paisanohermes at hotmail.com (Hermes Morales) Date: Thu, 30 Apr 2020 22:15:12 +0000 Subject: [scikit-learn] Why does sklearn require one-hot-encoding for categorical features? Can we have a "factor" data type? In-Reply-To: <20200430201206.75tl2ohkxo5yerlo@phare.normalesup.org> References: , <20200430201206.75tl2ohkxo5yerlo@phare.normalesup.org> Message-ID: Perhaps pd.factorize could hello? Obtener Outlook para Android ________________________________ From: scikit-learn on behalf of Gael Varoquaux Sent: Thursday, April 30, 2020 5:12:06 PM To: Scikit-learn mailing list Subject: Re: [scikit-learn] Why does sklearn require one-hot-encoding for categorical features? Can we have a "factor" data type? On Thu, Apr 30, 2020 at 03:55:00PM -0400, C W wrote: > I've used R and Stata software, none needs such transformation. They have a > data type called "factors", which is different from "numeric". > My problem with OHE: > One-hot-encoding results in large number of features. This really blows up > quickly. And I have to fight curse of dimensionality with PCA reduction. That's > not cool! Most statistical models still not one-hot encoding behind the hood. So, R and stata do it too. Typically, tree-based models can be adapted to work directly on categorical data. Ours don't. It's work in progress. G _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fscikit-learn&data=02%7C01%7C%7Ce7aa6f99b7914a1f84b208d7ed430801%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637238744453345410&sdata=e3BfHB4v5VFteeZ0Zh3FJ9Wcz9KmkUwur5i8Reue3mc%3D&reserved=0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From tmrsg11 at gmail.com Thu Apr 30 23:08:44 2020 From: tmrsg11 at gmail.com (C W) Date: Thu, 30 Apr 2020 23:08:44 -0400 Subject: [scikit-learn] Why does sklearn require one-hot-encoding for categorical features? Can we have a "factor" data type? In-Reply-To: References: <20200430201206.75tl2ohkxo5yerlo@phare.normalesup.org> Message-ID: Hermes, That's an interesting function. Does it work with sklearn after factorize? Is there any example? Thanks! On Thu, Apr 30, 2020 at 6:51 PM Hermes Morales wrote: > Perhaps pd.factorize could hello? > > Obtener Outlook para Android > > ------------------------------ > *From:* scikit-learn hotmail.com at python.org> on behalf of Gael Varoquaux < > gael.varoquaux at normalesup.org> > *Sent:* Thursday, April 30, 2020 5:12:06 PM > *To:* Scikit-learn mailing list > *Subject:* Re: [scikit-learn] Why does sklearn require one-hot-encoding > for categorical features? Can we have a "factor" data type? > > On Thu, Apr 30, 2020 at 03:55:00PM -0400, C W wrote: > > I've used R and Stata software, none needs such transformation. They > have a > > data type called "factors", which is different from "numeric". > > > My problem with OHE: > > One-hot-encoding results in large number of features. This really blows > up > > quickly. And I have to fight curse of dimensionality with PCA reduction. > That's > > not cool! > > Most statistical models still not one-hot encoding behind the hood. So, R > and stata do it too. > > Typically, tree-based models can be adapted to work directly on > categorical data. Ours don't. It's work in progress. > > G > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fscikit-learn&data=02%7C01%7C%7Ce7aa6f99b7914a1f84b208d7ed430801%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637238744453345410&sdata=e3BfHB4v5VFteeZ0Zh3FJ9Wcz9KmkUwur5i8Reue3mc%3D&reserved=0 > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: