From tevang3 at gmail.com Thu Mar 1 08:27:14 2018 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Thu, 1 Mar 2018 14:27:14 +0100 Subject: [scikit-learn] custom loss function in RandomForestRegressor In-Reply-To: References: <20180215182849.5115986.63202.48707@gmail.com> Message-ID: Hi again, I am currently revisiting this problem after familiarizing myself with Cython and Scikit-Learn's code and I have a very important query: Looking at the class MSE(RegressionCriterion), the node impurity is defined as the variance of the target values Y on that node. The predictions X are nowhere involved in the computations. This contradicts my notion of "loss function", which quantifies the discrepancy between predicted and target values. Am I looking at the wrong class or what I want to do is just not feasible with Random Forests? For example, I would like to modify the RandomForestRegressor code to minimize the Pearson's R between predicted and target values. I thank you in advance for any clarification. Thomas > >> On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote: >> >> Yes you are right pxd are the header and pyx the definition. You need to >> write a class as MSE. Criterion is an abstract class or base class (I don't >> have it under the eye) >> >> @Andy: if I recall the PR, we made the classes public to enable such >> custom criterion. However, ?it is not documented since we were not >> officially supporting it. So this is an hidden feature. We could always >> discuss to make this feature more visible and document it. >> >> >> > -- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Thu Mar 1 08:55:48 2018 From: se.raschka at gmail.com (Sebastian Raschka) Date: Thu, 1 Mar 2018 08:55:48 -0500 Subject: [scikit-learn] custom loss function in RandomForestRegressor In-Reply-To: References: <20180215182849.5115986.63202.48707@gmail.com> Message-ID: Hi, Thomas, in regression trees, minimizing the variance among the target values is equivalent to minimizing the MSE between targets and predicted values. This is also called variance reduction: https://en.wikipedia.org/wiki/Decision_tree_learning#Variance_reduction Best, Sebastian > On Mar 1, 2018, at 8:27 AM, Thomas Evangelidis wrote: > > > Hi again, > > I am currently revisiting this problem after familiarizing myself with Cython and Scikit-Learn's code and I have a very important query: > > Looking at the class MSE(RegressionCriterion), the node impurity is defined as the variance of the target values Y on that node. The predictions X are nowhere involved in the computations. This contradicts my notion of "loss function", which quantifies the discrepancy between predicted and target values. Am I looking at the wrong class or what I want to do is just not feasible with Random Forests? For example, I would like to modify the RandomForestRegressor code to minimize the Pearson's R between predicted and target values. > > I thank you in advance for any clarification. > Thomas > > > > > On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote: >> Yes you are right pxd are the header and pyx the definition. You need to write a class as MSE. Criterion is an abstract class or base class (I don't have it under the eye) >> >> @Andy: if I recall the PR, we made the classes public to enable such custom criterion. However, ?it is not documented since we were not officially supporting it. So this is an hidden feature. We could always discuss to make this feature more visible and document it. > > > > > > -- > ====================================================================== > Dr Thomas Evangelidis > Post-doctoral Researcher > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > tevang3 at gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From tevang3 at gmail.com Thu Mar 1 09:39:43 2018 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Thu, 1 Mar 2018 15:39:43 +0100 Subject: [scikit-learn] custom loss function in RandomForestRegressor In-Reply-To: References: <20180215182849.5115986.63202.48707@gmail.com> Message-ID: Hi Sebastian, Going back to Pearson's R loss function, does this imply that I must add an abstract "init2" method to RegressionCriterion (that's where MSE class inherits from) where I will add the target values X as extra argument? And then the node impurity will be 1-R (the lowest the best)? What about the impurities of the left and right split? In MSE class they are (sum_i^n y_i)**2 where n is the number of samples in the respective split. It is not clear how this is related to variance in order to adapt it for my purpose. Best, Thomas On Mar 1, 2018 14:56, "Sebastian Raschka" wrote: Hi, Thomas, in regression trees, minimizing the variance among the target values is equivalent to minimizing the MSE between targets and predicted values. This is also called variance reduction: https://en.wikipedia.org/wiki/ Decision_tree_learning#Variance_reduction Best, Sebastian > On Mar 1, 2018, at 8:27 AM, Thomas Evangelidis wrote: > > > Hi again, > > I am currently revisiting this problem after familiarizing myself with Cython and Scikit-Learn's code and I have a very important query: > > Looking at the class MSE(RegressionCriterion), the node impurity is defined as the variance of the target values Y on that node. The predictions X are nowhere involved in the computations. This contradicts my notion of "loss function", which quantifies the discrepancy between predicted and target values. Am I looking at the wrong class or what I want to do is just not feasible with Random Forests? For example, I would like to modify the RandomForestRegressor code to minimize the Pearson's R between predicted and target values. > > I thank you in advance for any clarification. > Thomas > > > > > On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote: >> Yes you are right pxd are the header and pyx the definition. You need to write a class as MSE. Criterion is an abstract class or base class (I don't have it under the eye) >> >> @Andy: if I recall the PR, we made the classes public to enable such custom criterion. However, ?it is not documented since we were not officially supporting it. So this is an hidden feature. We could always discuss to make this feature more visible and document it. > > > > > > -- > ====================================================================== > Dr Thomas Evangelidis > Post-doctoral Researcher > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > tevang3 at gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Thu Mar 1 09:47:54 2018 From: se.raschka at gmail.com (Sebastian Raschka) Date: Thu, 1 Mar 2018 09:47:54 -0500 Subject: [scikit-learn] custom loss function in RandomForestRegressor In-Reply-To: References: <20180215182849.5115986.63202.48707@gmail.com> Message-ID: Hi, Thomas, as far as I know, it's all the same and doesn't matter, and you would get the same splits, since R^2 is just a rescaled MSE. Best, Sebastian > On Mar 1, 2018, at 9:39 AM, Thomas Evangelidis wrote: > > Hi Sebastian, > > Going back to Pearson's R loss function, does this imply that I must add an abstract "init2" method to RegressionCriterion (that's where MSE class inherits from) where I will add the target values X as extra argument? And then the node impurity will be 1-R (the lowest the best)? What about the impurities of the left and right split? In MSE class they are (sum_i^n y_i)**2 where n is the number of samples in the respective split. It is not clear how this is related to variance in order to adapt it for my purpose. > > Best, > Thomas > > > On Mar 1, 2018 14:56, "Sebastian Raschka" wrote: > Hi, Thomas, > > in regression trees, minimizing the variance among the target values is equivalent to minimizing the MSE between targets and predicted values. This is also called variance reduction: https://en.wikipedia.org/wiki/Decision_tree_learning#Variance_reduction > > Best, > Sebastian > > > On Mar 1, 2018, at 8:27 AM, Thomas Evangelidis wrote: > > > > > > Hi again, > > > > I am currently revisiting this problem after familiarizing myself with Cython and Scikit-Learn's code and I have a very important query: > > > > Looking at the class MSE(RegressionCriterion), the node impurity is defined as the variance of the target values Y on that node. The predictions X are nowhere involved in the computations. This contradicts my notion of "loss function", which quantifies the discrepancy between predicted and target values. Am I looking at the wrong class or what I want to do is just not feasible with Random Forests? For example, I would like to modify the RandomForestRegressor code to minimize the Pearson's R between predicted and target values. > > > > I thank you in advance for any clarification. > > Thomas > > > > > > > > > > On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote: > >> Yes you are right pxd are the header and pyx the definition. You need to write a class as MSE. Criterion is an abstract class or base class (I don't have it under the eye) > >> > >> @Andy: if I recall the PR, we made the classes public to enable such custom criterion. However, ?it is not documented since we were not officially supporting it. So this is an hidden feature. We could always discuss to make this feature more visible and document it. > > > > > > > > > > > > -- > > ====================================================================== > > Dr Thomas Evangelidis > > Post-doctoral Researcher > > CEITEC - Central European Institute of Technology > > Masaryk University > > Kamenice 5/A35/2S049, > > 62500 Brno, Czech Republic > > > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From tevang3 at gmail.com Thu Mar 1 09:59:25 2018 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Thu, 1 Mar 2018 15:59:25 +0100 Subject: [scikit-learn] custom loss function in RandomForestRegressor In-Reply-To: References: <20180215182849.5115986.63202.48707@gmail.com> Message-ID: Does this generalize to any loss function? For example I also want to implement Kendall's tau correlation coefficient and a combination of R, tau and RMSE. :) On Mar 1, 2018 15:49, "Sebastian Raschka" wrote: > Hi, Thomas, > > as far as I know, it's all the same and doesn't matter, and you would get > the same splits, since R^2 is just a rescaled MSE. > > Best, > Sebastian > > > On Mar 1, 2018, at 9:39 AM, Thomas Evangelidis > wrote: > > > > Hi Sebastian, > > > > Going back to Pearson's R loss function, does this imply that I must add > an abstract "init2" method to RegressionCriterion (that's where MSE class > inherits from) where I will add the target values X as extra argument? And > then the node impurity will be 1-R (the lowest the best)? What about the > impurities of the left and right split? In MSE class they are (sum_i^n > y_i)**2 where n is the number of samples in the respective split. It is not > clear how this is related to variance in order to adapt it for my purpose. > > > > Best, > > Thomas > > > > > > On Mar 1, 2018 14:56, "Sebastian Raschka" wrote: > > Hi, Thomas, > > > > in regression trees, minimizing the variance among the target values is > equivalent to minimizing the MSE between targets and predicted values. This > is also called variance reduction: https://en.wikipedia.org/wiki/ > Decision_tree_learning#Variance_reduction > > > > Best, > > Sebastian > > > > > On Mar 1, 2018, at 8:27 AM, Thomas Evangelidis > wrote: > > > > > > > > > Hi again, > > > > > > I am currently revisiting this problem after familiarizing myself with > Cython and Scikit-Learn's code and I have a very important query: > > > > > > Looking at the class MSE(RegressionCriterion), the node impurity is > defined as the variance of the target values Y on that node. The > predictions X are nowhere involved in the computations. This contradicts my > notion of "loss function", which quantifies the discrepancy between > predicted and target values. Am I looking at the wrong class or what I want > to do is just not feasible with Random Forests? For example, I would like > to modify the RandomForestRegressor code to minimize the Pearson's R > between predicted and target values. > > > > > > I thank you in advance for any clarification. > > > Thomas > > > > > > > > > > > > > > > On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote: > > >> Yes you are right pxd are the header and pyx the definition. You need > to write a class as MSE. Criterion is an abstract class or base class (I > don't have it under the eye) > > >> > > >> @Andy: if I recall the PR, we made the classes public to enable such > custom criterion. However, ?it is not documented since we were not > officially supporting it. So this is an hidden feature. We could always > discuss to make this feature more visible and document it. > > > > > > > > > > > > > > > > > > -- > > > ====================================================================== > > > Dr Thomas Evangelidis > > > Post-doctoral Researcher > > > CEITEC - Central European Institute of Technology > > > Masaryk University > > > Kamenice 5/A35/2S049, > > > 62500 Brno, Czech Republic > > > > > > email: tevang at pharm.uoa.gr > > > tevang3 at gmail.com > > > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Thu Mar 1 10:03:45 2018 From: se.raschka at gmail.com (Sebastian Raschka) Date: Thu, 1 Mar 2018 10:03:45 -0500 Subject: [scikit-learn] custom loss function in RandomForestRegressor In-Reply-To: References: <20180215182849.5115986.63202.48707@gmail.com> Message-ID: <7C6BFE3A-F0F1-4F04-BA50-97C326BB2701@gmail.com> Unfortunately (or maybe fortunately :)) no, maximizing variance reduction & minimizing MSE are just special cases :) Best, Sebastian > On Mar 1, 2018, at 9:59 AM, Thomas Evangelidis wrote: > > Does this generalize to any loss function? For example I also want to implement Kendall's tau correlation coefficient and a combination of R, tau and RMSE. :) > > On Mar 1, 2018 15:49, "Sebastian Raschka" wrote: > Hi, Thomas, > > as far as I know, it's all the same and doesn't matter, and you would get the same splits, since R^2 is just a rescaled MSE. > > Best, > Sebastian > > > On Mar 1, 2018, at 9:39 AM, Thomas Evangelidis wrote: > > > > Hi Sebastian, > > > > Going back to Pearson's R loss function, does this imply that I must add an abstract "init2" method to RegressionCriterion (that's where MSE class inherits from) where I will add the target values X as extra argument? And then the node impurity will be 1-R (the lowest the best)? What about the impurities of the left and right split? In MSE class they are (sum_i^n y_i)**2 where n is the number of samples in the respective split. It is not clear how this is related to variance in order to adapt it for my purpose. > > > > Best, > > Thomas > > > > > > On Mar 1, 2018 14:56, "Sebastian Raschka" wrote: > > Hi, Thomas, > > > > in regression trees, minimizing the variance among the target values is equivalent to minimizing the MSE between targets and predicted values. This is also called variance reduction: https://en.wikipedia.org/wiki/Decision_tree_learning#Variance_reduction > > > > Best, > > Sebastian > > > > > On Mar 1, 2018, at 8:27 AM, Thomas Evangelidis wrote: > > > > > > > > > Hi again, > > > > > > I am currently revisiting this problem after familiarizing myself with Cython and Scikit-Learn's code and I have a very important query: > > > > > > Looking at the class MSE(RegressionCriterion), the node impurity is defined as the variance of the target values Y on that node. The predictions X are nowhere involved in the computations. This contradicts my notion of "loss function", which quantifies the discrepancy between predicted and target values. Am I looking at the wrong class or what I want to do is just not feasible with Random Forests? For example, I would like to modify the RandomForestRegressor code to minimize the Pearson's R between predicted and target values. > > > > > > I thank you in advance for any clarification. > > > Thomas > > > > > > > > > > > > > > > On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote: > > >> Yes you are right pxd are the header and pyx the definition. You need to write a class as MSE. Criterion is an abstract class or base class (I don't have it under the eye) > > >> > > >> @Andy: if I recall the PR, we made the classes public to enable such custom criterion. However, ?it is not documented since we were not officially supporting it. So this is an hidden feature. We could always discuss to make this feature more visible and document it. > > > > > > > > > > > > > > > > > > -- > > > ====================================================================== > > > Dr Thomas Evangelidis > > > Post-doctoral Researcher > > > CEITEC - Central European Institute of Technology > > > Masaryk University > > > Kamenice 5/A35/2S049, > > > 62500 Brno, Czech Republic > > > > > > email: tevang at pharm.uoa.gr > > > tevang3 at gmail.com > > > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From ranjanagirish30 at gmail.com Mon Mar 5 09:18:26 2018 From: ranjanagirish30 at gmail.com (Ranjana Girish) Date: Mon, 5 Mar 2018 19:48:26 +0530 Subject: [scikit-learn] help-Renaming features in Sckit-learn's CountVectorizer() Message-ID: Hai all, I have a very large pandas dataframe. Below is the sample * Id description* 1 switvch for air conditioner transformer.............. 2 control tfrmr........... 3 coling pad................. 4 DRLG machine 5 hair smothing kit............... For further process, I will contruct doument-term matrix of above data using Sckit-learn's countvectorizer *countvec = CountVectorizer()* *documenttermmatrix=countvec.fit_transform( dataset['description'])* I have to correct misspelled features in description. Replacing wrongly spelled word with correctly spelled word for large dataset is taking so much of time. So i thought of correcting features using features list in count vectorizer given by code *features_names= **countvec.get_feature_names()* *Is it possible to rename features using above list and further use it for classification process???* Thanks Ranjana -------------- next part -------------- An HTML attachment was scrubbed... URL: From chethanmuralisv at gmail.com Mon Mar 5 11:19:39 2018 From: chethanmuralisv at gmail.com (CHETHAN MURALI) Date: Mon, 5 Mar 2018 21:49:39 +0530 Subject: [scikit-learn] Need help in dealing with large dataset Message-ID: Dear All, I am working on building a CNN model for image classification problem. As par of it I have converted all my test images to numpy array. Now when I am trying to split the array into training and test set I am getting memory error. Details are as below: X = np.load("./data/X_train.npy", mmap_mode='r') train_pct_index = int(0.8 * len(X)) X_train, X_test = X[:train_pct_index], X[train_pct_index:] X_train = X_train.reshape(X_train.shape[0], 256, 256, 3) X_train = X_train.astype('float32') -------------------------------------------------MemoryError Traceback (most recent call last) in () 2 print("Normalizing Data") 3 ----> 4 X_train = X_train.astype('float32') *More information:* *1. my python version is* python --versionPython 3.6.4 :: Anaconda custom (64-bit) *2. I am running the code in ubuntu ubuntu 16.04.* *3. I have 32GB RAM* *4. X_train.npy file that I have loaded to np.array is of size 20GB* print("X_train Shape: ", X_train.shape) X_train Shape: (85108, 256, 256, 3) I would be really glad if you can help me to overcome this problem. Regards, - Chethan -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Mon Mar 5 12:13:33 2018 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Mon, 5 Mar 2018 18:13:33 +0100 Subject: [scikit-learn] Need help in dealing with large dataset In-Reply-To: References: Message-ID: If you work with deep net you need to check the utils from the deep net library. For instance in keras, you should create a batch generator if you need to deal with large dataset. In patch torch you can use the data loader which and the ImageFolder from torchvision which manage the loading for you. On 5 March 2018 at 17:19, CHETHAN MURALI wrote: > Dear All, > > I am working on building a CNN model for image classification problem. > As par of it I have converted all my test images to numpy array. > > Now when I am trying to split the array into training and test set I am > getting memory error. > Details are as below: > > X = np.load("./data/X_train.npy", mmap_mode='r') > train_pct_index = int(0.8 * len(X)) > X_train, X_test = X[:train_pct_index], X[train_pct_index:] > X_train = X_train.reshape(X_train.shape[0], 256, 256, 3) > > X_train = X_train.astype('float32') > -------------------------------------------------MemoryError Traceback (most recent call last) in () > 2 print("Normalizing Data") > 3 ----> 4 X_train = X_train.astype('float32') > > *More information:* > > *1. my python version is* > > python --versionPython 3.6.4 :: Anaconda custom (64-bit) > > *2. I am running the code in ubuntu ubuntu 16.04.* > > *3. I have 32GB RAM* > > *4. X_train.npy file that I have loaded to np.array is of size 20GB* > > print("X_train Shape: ", X_train.shape) > X_train Shape: (85108, 256, 256, 3) > > I would be really glad if you can help me to overcome this problem. > > Regards, > - > Chethan > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Mon Mar 5 12:28:27 2018 From: se.raschka at gmail.com (Sebastian Raschka) Date: Mon, 5 Mar 2018 12:28:27 -0500 Subject: [scikit-learn] Need help in dealing with large dataset In-Reply-To: References: Message-ID: Like Guillaume suggested, you don't want to load the whole array into memory if it's that large. There are many different ways for how to deal with this. The most naive way would be to break up your NumPy array into smaller NumPy array and load them iteratively with a running accuracy calculation. My suggestion would be to create a HDF5 file from the NumPy array where each entry is an image. If it's just the test images, you can also save a batch of them as entry because you don't need to shuffle them anyway. Ultimately, the recommendation based on the sweet spot between performance and convenience depends on what DL framework you use. Since this is a scikit-learn forum, I suppose you are using sklearn objects (although, I am not aware that sklearn has CNNs). The DataLoader in PyTorch is universally useful though and can come in handy no matter what CNN implementation you use. I have some examples here if that helps: - https://github.com/rasbt/deep-learning-book/blob/master/code/model_zoo/pytorch_ipynb/custom-data-loader-celeba.ipynb - https://github.com/rasbt/deep-learning-book/blob/master/code/model_zoo/pytorch_ipynb/custom-data-loader-csv.ipynb Best, Sebastian > On Mar 5, 2018, at 12:13 PM, Guillaume Lema?tre wrote: > > If you work with deep net you need to check the utils from the deep net library. > For instance in keras, you should create a batch generator if you need to deal with large dataset. > In patch torch you can use the data loader which and the ImageFolder from torchvision which manage > the loading for you. > > On 5 March 2018 at 17:19, CHETHAN MURALI wrote: > Dear All, > > I am working on building a CNN model for image classification problem. > As par of it I have converted all my test images to numpy array. > > Now when I am trying to split the array into training and test set I am getting memory error. > Details are as below: > > X = np.load("./data/X_train.npy", mmap_mode='r') > > train_pct_index > = int(0.8 * len(X)) > > X_train > , X_test = X[:train_pct_index], X[train_pct_index:] > > X_train > = X_train.reshape(X_train.shape[0], 256, 256, 3) > > > X_train > = X_train.astype('float32') > > > > ------------------------------------------------- > MemoryError Traceback (most recent call last) > in () > > > 2 print("Normalizing Data") > > > 3 > > > ----> 4 X_train = X_train.astype('float32') > More information: > > 1. my python version is > > python -- > version > > Python 3.6.4 :: Anaconda custom (64-bit) > 2. I am running the code in ubuntu ubuntu 16.04. > > 3. I have 32GB RAM > > 4. X_train.npy file that I have loaded to np.array is of size 20GB > > print("X_train Shape: ", X_train.shape) > > X_train > Shape: (85108, 256, 256, 3) > I would be really glad if you can help me to overcome this problem. > > Regards, > - > Chethan > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From joel.nothman at gmail.com Mon Mar 5 17:08:32 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 6 Mar 2018 09:08:32 +1100 Subject: [scikit-learn] help-Renaming features in Sckit-learn's CountVectorizer() In-Reply-To: References: Message-ID: You can effectively merge features through matrix multiplication: multiply the CountVectorizer output by a sparse matrix of shape (n_features_in, n_features_out) which has 1 where the output feature corresponds to an input feature. Your spelling correction then consists of building this mapping matrix. -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Mar 5 18:21:53 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 5 Mar 2018 18:21:53 -0500 Subject: [scikit-learn] transfer-learning for random forests In-Reply-To: References: Message-ID: <9d10d61a-b043-3180-36e5-a48d6719c9a8@gmail.com> http://scikit-learn.org/dev/faq.html#what-are-the-inclusion-criteria-for-new-algorithms On 02/16/2018 04:51 AM, peignier sergio wrote: > > Hello, > > I recently begun a research project on Transfer Learning with some > colleagues. We would like to contribute to scikit-learn incorporating > Transfer Learning functions for Random Forests as described in this > recent paper: ** > > *https://arxiv.org/abs/1511.01258* > > * > > > *Before starting we would like to ensure that no existing project is > ongoing. > > Thanks! > > BR, > > Sergio Peignier > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From suprajasankari at gmail.com Thu Mar 8 02:28:54 2018 From: suprajasankari at gmail.com (SJ JV) Date: Thu, 8 Mar 2018 16:28:54 +0900 Subject: [scikit-learn] Custom kernel for PLS Regression (using NIPALS algorithm) Message-ID: I have to provide a list of customized kernels to the PLSRegression api. Similar to the custom kernel support for SVM, is there support for providing kernels to PLSRegression ? Can you make this available, if not ? Thanks SV -- U -------------- next part -------------- An HTML attachment was scrubbed... URL: From bertrand.thirion at inria.fr Thu Mar 8 15:55:14 2018 From: bertrand.thirion at inria.fr (bthirion) Date: Thu, 8 Mar 2018 21:55:14 +0100 Subject: [scikit-learn] Custom kernel for PLS Regression (using NIPALS algorithm) In-Reply-To: References: Message-ID: <803a54ac-81b2-4fdb-cf51-122f209e5a24@inria.fr> No this does not exist. It may be a good addition to the library, but could you elaborate a bit on the use-case ? A workaround to this could be to provide PLS Regression a feature representation that implictily embodies the kernel similarity. Accoding to the chosen kernel, this can be easy or not. Best, Bertrand Thirion On 08/03/2018 08:28, SJ JV wrote: > I have to provide a list of customized kernels to the PLSRegression > api. Similar to the custom kernel support for SVM, is there support > for providing kernels to PLSRegression ? Can you make this available, > if not ? > > Thanks > SV > > -- > U > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From suprajasankari at gmail.com Fri Mar 9 03:47:10 2018 From: suprajasankari at gmail.com (SJ JV) Date: Fri, 9 Mar 2018 17:47:10 +0900 Subject: [scikit-learn] Custom kernel for PLS Regression (using NIPALS algorithm) In-Reply-To: <803a54ac-81b2-4fdb-cf51-122f209e5a24@inria.fr> References: <803a54ac-81b2-4fdb-cf51-122f209e5a24@inria.fr> Message-ID: Hi Betrand Thanks for the reply. Well, what i have is n correlation matrices of the brain (n is the number of participants in the study). The simplest kernel computes the dot product between the n matrices. The kernel is further optimized using the NIPALS algorithm (as in Rosipal, Trejo 2002) The output y is multivariate with values indicating test scores from ADOS evaluations. On Fri, Mar 9, 2018 at 5:55 AM, bthirion wrote: > No this does not exist. It may be a good addition to the library, but > could you elaborate a bit on the use-case ? > > A workaround to this could be to provide PLS Regression a feature > representation that implictily embodies the kernel similarity. Accoding to > the chosen kernel, this can be easy or not. > Best, > > Bertrand Thirion > > On 08/03/2018 08:28, SJ JV wrote: > > I have to provide a list of customized kernels to the PLSRegression api. > Similar to the custom kernel support for SVM, is there support for > providing kernels to PLSRegression ? Can you make this available, if not ? > > Thanks > SV > > -- > U > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- U -------------- next part -------------- An HTML attachment was scrubbed... URL: From stephen_j_jeffrey at yahoo.com.au Fri Mar 9 16:58:49 2018 From: stephen_j_jeffrey at yahoo.com.au (Stephen Jeffrey) Date: Fri, 9 Mar 2018 21:58:49 +0000 (UTC) Subject: [scikit-learn] OOB decision function in RandomForestClassifier References: <164754999.16859.1520632729278.ref@mail.yahoo.com> Message-ID: <164754999.16859.1520632729278@mail.yahoo.com> Hi, When using RFC on a multiclass problem with a large number of trees, would you expect the prediction for a given sample to match the OOB decision function i.e. should the prediction match the class with the highest OOB value for the given sample, when n_estimators is large? On my 3-class problem, the oob_decision_function_ for a given sample is [ 0.31091392? 0.2982096? ?0.39087648] but the prediction for that sample is the middle class (OOB=0.29), whereas I thought it should have been the last class (which has the higher OOB value of 0.39).? According to the docs:1. The ensemble prediction is a weighted average of the prediction from each individual tree:In contrast to the original publication [B2001], the scikit-learn implementation combines classifiers by averaging their probabilistic prediction, instead of letting each classifier vote for a single class. (taken from section 1.11.2.1 in?1.11. Ensemble methods ? scikit-learn 0.19.1 documentation)2. The OOB values are for a given sample are the fraction of out-of-bag predictions for each class (see http://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html) I thought the prediction for a given sample would converge to the class with the highest OOB value?as the number of trees increases, and consequently thought that I could interpret the OOB values for a given sample as the probability of that sample belonging to the various classes. Is this incorrect? RegardsSteve -------------- next part -------------- An HTML attachment was scrubbed... URL: From princegosavi12 at gmail.com Sat Mar 10 06:32:06 2018 From: princegosavi12 at gmail.com (prince gosavi) Date: Sat, 10 Mar 2018 17:02:06 +0530 Subject: [scikit-learn] KMeans default distance function Message-ID: Hi, I am using KMeans for clustering purpose on my data. I am interested in the distance function used by KMeans for creating clusters and determining the cluster points. I have read the http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn-cluster-kmeans documentation but was not able to find the method used. If possible please give a example as to how it is done. from sklearn.cluster import KMeans KMeans(max_iter=4,n_clusters=10,n_init=10).fit(X) where X has 14 features lets say for example [0, 0, 2, 8, 0, 0, 3, 16, 8, 39, 1, 0, 0, 2] [0, 0, 3, 9, 0, 0, 3, 1, 8, 9, 1, 0, 0, 1] Also if you can show me how KMeans can be implemented on my data it would certainly help. -- Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sat Mar 10 19:47:41 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Sun, 11 Mar 2018 11:47:41 +1100 Subject: [scikit-learn] KMeans default distance function In-Reply-To: References: Message-ID: kmeans necessarily uses Euclidean distance, and a patch to the documentation is welcome. -------------- next part -------------- An HTML attachment was scrubbed... URL: From princegosavi12 at gmail.com Sun Mar 11 03:53:56 2018 From: princegosavi12 at gmail.com (prince gosavi) Date: Sun, 11 Mar 2018 13:23:56 +0530 Subject: [scikit-learn] KMeans default distance function In-Reply-To: References: <1520729790665.111908960@boxbe> Message-ID: Ok will look forward to do it. Thank you. On Mar 11, 2018 06:22, "Joel Nothman" wrote: [image: Boxbe] This message is eligible for Automatic Cleanup! (joel.nothman at gmail.com) Add cleanup rule | More info kmeans necessarily uses Euclidean distance, and a patch to the documentation is welcome. _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From princegosavi12 at gmail.com Mon Mar 12 02:55:17 2018 From: princegosavi12 at gmail.com (prince gosavi) Date: Mon, 12 Mar 2018 12:25:17 +0530 Subject: [scikit-learn] Using KMeans cluster labels in KNN Message-ID: Hi, I have generated clusters using the KMeans algorithm and would like to use the labels of the model in the KNN. I don't have the implementation idea but I can visualize it as KNNmodel = KNN.fit(X, KMeansModel.labels_) Such that the KNN will predict the cluster the new point belong to. -- Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Mon Mar 12 06:46:21 2018 From: se.raschka at gmail.com (Sebastian Raschka) Date: Mon, 12 Mar 2018 06:46:21 -0400 Subject: [scikit-learn] Using KMeans cluster labels in KNN In-Reply-To: References: Message-ID: <8E0E36CF-A26F-4322-8A39-6CFBF51DDD55@gmail.com> Hi, If you want to predict the Kmeans cluster membership, you can use Kmeans' predict method instead of training a KNN model on the cluster assignments. This will be computationally more efficient and give you the correct assignment at the borders between clusters. Best, Sebastian > On Mar 12, 2018, at 2:55 AM, prince gosavi wrote: > > Hi, > I have generated clusters using the KMeans algorithm and would like to use the labels of the model in the KNN. > > I don't have the implementation idea but I can visualize it as > > KNNmodel = KNN.fit(X, KMeansModel.labels_) > > Such that the KNN will predict the cluster the new point belong to. > > -- > Regards > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From princegosavi12 at gmail.com Mon Mar 12 11:34:10 2018 From: princegosavi12 at gmail.com (prince gosavi) Date: Mon, 12 Mar 2018 21:04:10 +0530 Subject: [scikit-learn] Using KMeans cluster labels in KNN In-Reply-To: <1520852855192.376601853@boxbe> References: <1520852855192.376601853@boxbe> Message-ID: Hi, Thank you for reply. I was exploring the possibility that given well formed KMean clusters using an additional KNN we can simply increase the accuracy that the data point enters the right cluster. Also I would like to know whether if it's possible to do such thing(out of curiosity)? On Mon, Mar 12, 2018 at 4:16 PM, Sebastian Raschka wrote: > [image: Boxbe] This message is eligible > for Automatic Cleanup! (se.raschka at gmail.com) Add cleanup rule > > | More info > > > Hi, > If you want to predict the Kmeans cluster membership, you can use Kmeans' > predict method instead of training a KNN model on the cluster assignments. > This will be computationally more efficient and give you the correct > assignment at the borders between clusters. > > Best, > Sebastian > > > On Mar 12, 2018, at 2:55 AM, prince gosavi > wrote: > > > > Hi, > > I have generated clusters using the KMeans algorithm and would like to > use the labels of the model in the KNN. > > > > I don't have the implementation idea but I can visualize it as > > > > KNNmodel = KNN.fit(X, KMeansModel.labels_) > > > > Such that the KNN will predict the cluster the new point belong to. > > > > -- > > Regards > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Mar 12 15:31:09 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 13 Mar 2018 06:31:09 +1100 Subject: [scikit-learn] Using KMeans cluster labels in KNN In-Reply-To: References: <1520852855192.376601853@boxbe> Message-ID: A meta-estimator for this (generic to which classifier / clusterer) is coded up at https://github.com/scikit-learn/scikit-learn/issues/4543#issuecomment-91073246 We even have a pull request that made an example of this sort of thing at https://github.com/scikit-learn/scikit-learn/pull/6478, but the original contributor never responded to comments on it. If someone would like to make it more persuasive and complete it, ... On 13 March 2018 at 02:34, prince gosavi wrote: > Hi, > Thank you for reply. > > I was exploring the possibility that given well formed KMean clusters > using an additional KNN we can simply increase the accuracy that the data > point enters the right cluster. > > Also I would like to know whether if it's possible to do such thing(out of > curiosity)? > > > On Mon, Mar 12, 2018 at 4:16 PM, Sebastian Raschka > wrote: > >> [image: Boxbe] This message is eligible >> for Automatic Cleanup! (se.raschka at gmail.com) Add cleanup rule >> >> | More info >> >> >> Hi, >> If you want to predict the Kmeans cluster membership, you can use Kmeans' >> predict method instead of training a KNN model on the cluster assignments. >> This will be computationally more efficient and give you the correct >> assignment at the borders between clusters. >> >> Best, >> Sebastian >> >> > On Mar 12, 2018, at 2:55 AM, prince gosavi >> wrote: >> > >> > Hi, >> > I have generated clusters using the KMeans algorithm and would like to >> use the labels of the model in the KNN. >> > >> > I don't have the implementation idea but I can visualize it as >> > >> > KNNmodel = KNN.fit(X, KMeansModel.labels_) >> > >> > Such that the KNN will predict the cluster the new point belong to. >> > >> > -- >> > Regards >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Regards > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From princegosavi12 at gmail.com Mon Mar 12 15:37:57 2018 From: princegosavi12 at gmail.com (prince gosavi) Date: Tue, 13 Mar 2018 01:07:57 +0530 Subject: [scikit-learn] Using KMeans cluster labels in KNN In-Reply-To: References: <1520852855192.376601853@boxbe> Message-ID: Hi Thank you I will definitely look into it. On Mar 13, 2018 01:03, "Joel Nothman" wrote: > A meta-estimator for this (generic to which classifier / clusterer) is > coded up at https://github.com/scikit-learn/scikit-learn/issues/ > 4543#issuecomment-91073246 > > We even have a pull request that made an example of this sort of thing at > https://github.com/scikit-learn/scikit-learn/pull/6478, but the original > contributor never responded to comments on it. If someone would like to > make it more persuasive and complete it, ... > > On 13 March 2018 at 02:34, prince gosavi wrote: > >> Hi, >> Thank you for reply. >> >> I was exploring the possibility that given well formed KMean clusters >> using an additional KNN we can simply increase the accuracy that the data >> point enters the right cluster. >> >> Also I would like to know whether if it's possible to do such thing(out >> of curiosity)? >> >> >> On Mon, Mar 12, 2018 at 4:16 PM, Sebastian Raschka >> wrote: >> >>> [image: Boxbe] This message is >>> eligible for Automatic Cleanup! (se.raschka at gmail.com) Add cleanup rule >>> >>> | More info >>> >>> >>> Hi, >>> If you want to predict the Kmeans cluster membership, you can use >>> Kmeans' predict method instead of training a KNN model on the cluster >>> assignments. This will be computationally more efficient and give you the >>> correct assignment at the borders between clusters. >>> >>> Best, >>> Sebastian >>> >>> > On Mar 12, 2018, at 2:55 AM, prince gosavi >>> wrote: >>> > >>> > Hi, >>> > I have generated clusters using the KMeans algorithm and would like to >>> use the labels of the model in the KNN. >>> > >>> > I don't have the implementation idea but I can visualize it as >>> > >>> > KNNmodel = KNN.fit(X, KMeansModel.labels_) >>> > >>> > Such that the KNN will predict the cluster the new point belong to. >>> > >>> > -- >>> > Regards >>> > _______________________________________________ >>> > scikit-learn mailing list >>> > scikit-learn at python.org >>> > https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> Regards >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From david.mo.burns at gmail.com Tue Mar 13 23:28:32 2018 From: david.mo.burns at gmail.com (David Burns) Date: Tue, 13 Mar 2018 23:28:32 -0400 Subject: [scikit-learn] seglearn: package for time series and sequence learning Message-ID: <9e438401-f339-31fb-1b13-d130b7d63b44@gmail.com> I implemented a meta-estimator and transformers for time series / sequence learning with sliding window segmentation. It can be used for classification, regression, or forecasting - supporting multivariate time series / sequences and contextual (time-independent) data. It can learn time series or contextual targets. It is (mostly) compatible with the sklearn model evaluation and selection tools - despite changing the number of samples and the target vector mid pipeline (during segmentation). I've created a pull request on related_projects.rst - but thought I would share it here for those of you interested in this area. https://github.com/dmbee/seglearn Cheers, David Burns From nadim.farhat at gmail.com Thu Mar 15 21:28:55 2018 From: nadim.farhat at gmail.com (Nadim Farhat) Date: Fri, 16 Mar 2018 01:28:55 +0000 Subject: [scikit-learn] Equivalent to Cost Matrix sklearn Message-ID: Dear All, I have a *screening* lab test and I am trying to minimize the False negative value in the recall (TP/(TP+FN)) therefore I want to increase the cost whenever an FN is found in the training. I understand that in R they have some kind of loss matrix that penalize the FN during fitting. my Postive classes percentage is 30 % On the forums and StackOverflow, they suggest using class_weight=balanced in the decision tree which oversamples the class with the lowest frequency. However, I don't see how that helps in minimizing the FN. Any suggestions? Bests Nadim -- Nadim Farhat -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Mar 16 00:25:17 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 16 Mar 2018 00:25:17 -0400 Subject: [scikit-learn] Equivalent to Cost Matrix sklearn In-Reply-To: References: Message-ID: <1a69a18c-dd3a-0171-5f6f-639512e1fbc9@gmail.com> Hi. Unfortunately we don't have an implementation of a cost matrix in sklearn directly, but you can change the threshold of the model prediction, by using something like y_pred = tree.predict_proba(X_test)[:, 1] > 0.6 What trade-off of precision and recall do you want? Have you looked at the precision_recall_curve? Andy On 03/15/2018 09:28 PM, Nadim Farhat wrote: > Dear All, > > I have a *screening* lab test and I am trying to minimize the False > negative value in the recall (TP/(TP+FN)) therefore I want to increase > the cost whenever an FN is found in the training. I understand that in > R they have some kind of loss matrix that penalize the FN during > fitting.? my Postive classes percentage is 30 % > On the forums and StackOverflow, they suggest using > class_weight=balanced in the decision tree which oversamples the class > with the lowest frequency.?However, I don't see how that helps in > minimizing the FN. > > Any suggestions? > > > Bests > > Nadim > > > > > > > > > > -- > Nadim Farhat > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuel.castejon at gmail.com Sat Mar 17 14:01:42 2018 From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=) Date: Sat, 17 Mar 2018 19:01:42 +0100 Subject: [scikit-learn] PipeGraph users guide Message-ID: Dear all, we have written a users guide to PipeGraph in order to help the interested readers to better understand how it works. While we improve the rst export (the figures are missing) the best version is the original jupyter notebook: *https://github.com/mcasl/PipeGraph/blob/master/doc/User_Guide.ipynb * Best Manolo -------------- next part -------------- An HTML attachment was scrubbed... URL: From goix.nicolas at gmail.com Sun Mar 18 08:26:16 2018 From: goix.nicolas at gmail.com (Nicolas Goix) Date: Sun, 18 Mar 2018 13:26:16 +0100 Subject: [scikit-learn] Equivalent to Cost Matrix sklearn In-Reply-To: <1a69a18c-dd3a-0171-5f6f-639512e1fbc9@gmail.com> References: <1a69a18c-dd3a-0171-5f6f-639512e1fbc9@gmail.com> Message-ID: Hi Nadim, you may also want to take a look at *skope-rules* ( https://github.com/scikit-learn-contrib/skope-rules), which has recently been added to scikit-learn-contrib. The main goal of this package is to provide logical rules verifying precision and recall conditions, by extracting them from a fitted tree ensemble and evaluating them out of bag. Nicolas On Fri, Mar 16, 2018 at 5:25 AM, Andreas Mueller wrote: > Hi. > > Unfortunately we don't have an implementation of a cost matrix in sklearn > directly, but you can change the threshold of the model prediction, > by using something like y_pred = tree.predict_proba(X_test)[:, 1] > 0.6 > > What trade-off of precision and recall do you want? Have you looked at the > precision_recall_curve? > > Andy > > > On 03/15/2018 09:28 PM, Nadim Farhat wrote: > > Dear All, > > I have a *screening* lab test and I am trying to minimize the False > negative value in the recall (TP/(TP+FN)) therefore I want to increase the > cost whenever an FN is found in the training. I understand that in R they > have some kind of loss matrix that penalize the FN during fitting. my > Postive classes percentage is 30 % > On the forums and StackOverflow, they suggest using class_weight=balanced > in the decision tree which oversamples the class with the lowest > frequency. However, I don't see how that helps in minimizing the FN. > > Any suggestions? > > > Bests > > Nadim > > > > > > > > > > -- > Nadim Farhat > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joaquin.vanschoren at gmail.com Tue Mar 20 06:23:01 2018 From: joaquin.vanschoren at gmail.com (Joaquin Vanschoren) Date: Tue, 20 Mar 2018 10:23:01 +0000 Subject: [scikit-learn] OpenML workshop in Porto, 3-7 April In-Reply-To: References: Message-ID: Dear all, We're organizing a small workshop/hackathon in a few weeks about OpenML and the OpenML/scikit-learn integration. It will be in sunny Porto, hosted by the Porto Business School in the week after Easter. If you'd like to work on anything related to running scikit-learn experiments and storing the results on OpenML, or simply fetching OpenML datasets into sklearn scripts, let me know! More info on OpenML: https://docs.openml.org/site/ Small guide and API reference: https://docs.openml.org/site/Python-guide/ GitHub page: https://github.com/openml/openml-python Github issue and pull request for the openml_fetch call: https://github.com/scikit-learn/scikit-learn/issues/9543 https://github.com/scikit-learn/scikit-learn/pull/9908 Cheers, Joaquin -- Thank you, Joaquin -------------- next part -------------- An HTML attachment was scrubbed... URL: From anael.beaugnon at ssi.gouv.fr Tue Mar 20 12:30:19 2018 From: anael.beaugnon at ssi.gouv.fr (Beaugnon Anael) Date: Tue, 20 Mar 2018 17:30:19 +0100 Subject: [scikit-learn] scikit-learn with celery (joblib issue) Message-ID: Dear all, When I run scikit-learn (version 0.19) into celery tasks, I cannot use scikit-learn multi-threading. The joblib library raises the following warning : UserWarning: Multiprocessing-backend parallel loops cannot be nested, setting n_jobs=1 I attended a presentation about the Loky project at PyParis in June 2017. It was presented as a solution to this issue. Is there any plan in the near future to solve this issue directly in scikit-learn ? Cheers, Ana?l Beaugnon -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Tue Mar 20 12:49:33 2018 From: g.lemaitre58 at gmail.com (Guillaume Lemaitre) Date: Tue, 20 Mar 2018 17:49:33 +0100 Subject: [scikit-learn] scikit-learn with celery (joblib issue) In-Reply-To: References: Message-ID: <20180320164933.5075028.23782.51468@gmail.com> An HTML attachment was scrubbed... URL: From shubhamashokgandhi at gmail.com Wed Mar 21 02:22:03 2018 From: shubhamashokgandhi at gmail.com (Shubham Ashok Gandhi) Date: Wed, 21 Mar 2018 11:52:03 +0530 Subject: [scikit-learn] Run time complexity of algorithms Message-ID: Hello, Hope you guys are doing well. I needed information on the time complexity of the models under supervised learning title. I am looking for this information because- we (my team) are building a platform that allows a user to run multiple models. The models we select to build are based on a couple of user requirements such as time, accuracy and model interpretability. This information will help us in understanding what models not to select for large datasets and so on. More specifically, I am looking for information on these algorithms OLS Elastic Net LARS Bayesian regression Linear Discriminant Analysis SVM (all kernels) Nearest Neighbors regression Decision Trees Random Forest AdaBoost Gradient Tree Boosting Let me know if there's additional information or details you need me to provide -- Regards, Shubham Ashok Gandhi Ph: (+91) 8987419771 <089874%2019771> -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Wed Mar 21 05:26:10 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 21 Mar 2018 20:26:10 +1100 Subject: [scikit-learn] Run time complexity of algorithms In-Reply-To: References: Message-ID: If you produce a catalogue of their runtime complexities, it would be great if you could contribute them back to the project's documentation. Thanks!? However, I suspect you'll find that the theoretical worst-case asymptotic runtime is often not what you're most interested in. -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Wed Mar 21 05:31:47 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 21 Mar 2018 20:31:47 +1100 Subject: [scikit-learn] Run time complexity of algorithms In-Reply-To: References: Message-ID: You may also be interested in the work at https://github.com/scikit-learn/scikit-learn/issues/10289 and perhaps interested in helping give feedback towards finishing it off. -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.ali.jamaoui at gmail.com Sun Mar 25 10:41:39 2018 From: m.ali.jamaoui at gmail.com (Mohamed Ali Jamaoui) Date: Sun, 25 Mar 2018 16:41:39 +0200 Subject: [scikit-learn] Add MAPE as a new metric Message-ID: Dear all, Following @amuller suggestion in issue 10708 , I submitted a PR to add MAPE as a new metric. @jnothman has already approved it. It would be great if someone else could review it as well to prepare for a potential merge. - PR link: https://github.com/scikit-learn/scikit-learn/pull/10711 Thanks & regards, Mohamed Ali JAMAOUI -------------- next part -------------- An HTML attachment was scrubbed... URL: From shalu.ashu50 at gmail.com Wed Mar 28 22:50:04 2018 From: shalu.ashu50 at gmail.com (Vishal Singh) Date: Thu, 29 Mar 2018 10:50:04 +0800 Subject: [scikit-learn] How to apply LR in gridded (for multiple location datasets) time series datasets? Message-ID: Hello, This code is written for multivariate (multiple independent variables x1,x2,x3..xn and a dependent variable y) time series analysis using logistic regression (correlation and prediction). #Import Libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd #Import Dataset dataset = pd.read_csv(?precipitation.csv?) x = dataset.iloc[:,[2,3]].values y =dataset.iloc[:,4].values #Split Training Set and Testing Set from sklearn.cross_validation import train_test_split x_train, x_test, y_train, y_test =train_test_split(x,y,test_size=0.25) #Feature Scaling from sklearn.preprocessing import StandardScaler sc_X=StandardScaler() x_train=sc_X.fit_transform(x_train) x_test=sc_X.transform(x_test) #Training the Logistic Model from sklearn.linear_model import LogisticRegression classifier = LogisticRegression() classifier.fit(x_train, y_train) #Predicting the Test Set Result y_pred = classifier.predict(x_test) This code is based on one point location (one lat/long) datasets. Suppose, I am having gridded datasets (which has many points/locations, lat/long, varying in space and time) then How I will implement this code. I am not expertise in python. If somebody can help me in this? If somebody can give me an example or idea so I can implement this code as per my requirement. Thank you in advance. Vishu -------------- next part -------------- An HTML attachment was scrubbed... URL: From jinwoo412 at gmail.com Wed Mar 28 23:49:25 2018 From: jinwoo412 at gmail.com (PARK Jinwoo) Date: Thu, 29 Mar 2018 12:49:25 +0900 Subject: [scikit-learn] (no subject) Message-ID: Dear scikit-learn experts Hello, I am a graduate school student majoring in doping control analysis in Korea. Now I'm in a research institute that carries out doping control analyses. I received a project by my advising doctor. It's about operating an AI project. A workshop is scheduled in April, so it needs to be done in a month. However, I haven't learn computer science at all and I'm totally ignorant of it. So I desperately need your advice. To be specific, the 3 xml files shown in the picture are analysis results named positive, negative, and unknown from top to bottom. We'd like to let AI learn positive and negative data, input unknown datum, and then see what result will turn out. I came to know that there's a module called 'iris calssification' in scikit-learn and I'm thinking of utilizing that as it seems similar with my assignment However, while the database of iris is a csv file with 150 data and labels inside, what I have are 3 xml files each one of which represents one data, which are stored in C:\Users\Jinwoo\Documents\Python Scripts\mzdata The training process is not shuffling randomly the 150 data and dividing into training set and test set. The data are already assigned into training ones and testing one. Also, when training the program, training labels naming positive and negative should be inserted on my own. What I know all is that it will be appropriate to use fit() function and predict() function to train and test. But I have no idea on what to import, how to write codes correctly, and so on It will be thankful to give me some help -------------- next part -------------- A non-text attachment was scrubbed... Name: MZdata.png Type: image/png Size: 12778 bytes Desc: not available URL: From jinwoo412 at gmail.com Thu Mar 29 00:06:04 2018 From: jinwoo412 at gmail.com (PARK Jinwoo) Date: Thu, 29 Mar 2018 04:06:04 +0000 Subject: [scikit-learn] =?utf-8?q?I=E2=80=99m_in_trouble_and_I_need_your_?= =?utf-8?q?advice_on_operating_scikit-learn?= Message-ID: Dear scikit-learn experts Hello, I am a graduate school student majoring in doping control analysis in Korea. Now I'm in a research institute that carries out doping control analyses. I received a project by my advising doctor. It's about operating an AI project. A workshop is scheduled in April, so it needs to be done in a month. However, I haven't learn computer science at all and I'm totally ignorant of it. So I desperately need your advice. To be specific, the 3 xml files shown in the picture are analysis results named positive, negative, and unknown from top to bottom. We'd like to let AI learn positive and negative data, input unknown datum, and then see what result will turn out. I came to know that there's a module called 'iris calssification' in scikit-learn and I'm thinking of utilizing that as it seems similar with my assignment However, while the database of iris is a csv file with 150 data and labels inside, what I have are 3 xml files each one of which represents one data, which are stored in C:\Users\Jinwoo\Documents\Python Scripts\mzdata The training process is not shuffling randomly the 150 data and dividing into training set and test set. The data are already assigned into training ones and testing one. Also, when training the program, training labels naming positive and negative should be inserted on my own. What I know all is that it will be appropriate to use fit() function and predict() function to train and test. But I have no idea on what to import, how to write codes correctly, and so on It will be thankful to give me some help. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ahowe42 at gmail.com Thu Mar 29 00:47:36 2018 From: ahowe42 at gmail.com (Andrew Howe) Date: Thu, 29 Mar 2018 07:47:36 +0300 Subject: [scikit-learn] =?utf-8?q?I=E2=80=99m_in_trouble_and_I_need_your_?= =?utf-8?q?advice_on_operating_scikit-learn?= In-Reply-To: References: Message-ID: Hi Jinwoo It is true that scikit-learn has many models for supervised classification tasks, and it should be relatively trivial for you to munge your 3 data files into the X (data) y (labels) format required for these methods. Examples are k-means, Support Vector Machines, Decision Trees, and Discriminant Analysis. However, these are typically considered "machine learning" techniques; when someone says "AI", they typically mean a Neural Network. If you wish to use scikit-learn for Neural Network classification, you are limited to the Multilayer Perceptron: http://scikit-learn.org/stable/modules/neural_networks_supervised.html#. If you want to be able to use more advanced Neural Networks, here are some options: *Deep neural networks etc.* - pylearn2 A deep learning and neural network library build on theano with scikit-learn like interface. - sklearn_theano scikit-learn compatible estimators, transformers, and datasets which use Theano internally - nolearn A number of wrappers and abstractions around existing neural network libraries - keras Deep Learning library capable of running on top of either TensorFlow or Theano. - lasagne A lightweight library to build and train neural networks in Theano. I personally use Google's TensorFlow. Hope this helps. Andrew <~~~~~~~~~~~~~~~~~~~~~~~~~~~> J. Andrew Howe, PhD LinkedIn Profile ResearchGate Profile Open Researcher and Contributor ID (ORCID) Github Profile Personal Website I live to learn, so I can learn to live. - me <~~~~~~~~~~~~~~~~~~~~~~~~~~~> On Thu, Mar 29, 2018 at 7:06 AM, PARK Jinwoo wrote: > Dear scikit-learn experts > > Hello, I am a graduate school student majoring in doping control > analysis in Korea. > Now I'm in a research institute that carries out doping control analyses. > > I received a project by my advising doctor. It's about operating an AI > project. > A workshop is scheduled in April, so it needs to be done in a month. > However, I haven't learn computer science at all and I'm totally ignorant > of it. > So I desperately need your advice. > > To be specific, the 3 xml files shown in the picture are analysis results > named positive, negative, and unknown from top to bottom. > We'd like to let AI learn positive and negative data, > input unknown datum, and then see what result will turn out. > > I came to know that there's a module called 'iris calssification' in > scikit-learn > and I'm thinking of utilizing that as it seems similar with my assignment > However, while the database of iris is a csv file with 150 data and > labels inside, > what I have are 3 xml files each one of which represents one data, > which are stored in C:\Users\Jinwoo\Documents\Python Scripts\mzdata > The training process is not shuffling randomly the 150 data and > dividing into training set and test set. The data are already assigned > into training ones and testing one. > Also, when training the program, training labels naming positive and > negative should be inserted on my own. > > What I know all is that it will be appropriate to use fit() function > and predict() function to train and test. > But I have no idea on what to import, how to write codes correctly, and so > on > > It will be thankful to give me some help. > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hamidizade.s at gmail.com Thu Mar 29 02:53:45 2018 From: hamidizade.s at gmail.com (S Hamidizade) Date: Thu, 29 Mar 2018 11:23:45 +0430 Subject: [scikit-learn] Validation curve - Learning curve Message-ID: Dear Mr. / Ms. I would appreciate if you could let me know in the following example code: from collections import Counter from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split,StratifiedKFold,learning_curve,validation_curve,GridSearchCV from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.metrics import classification_report import numpy as np import matplotlib.pyplot as plt def plot_learning_curve(train_sizes, train_scores, test_scores, title, alpha=0.1): train_mean = np.mean(train_scores, axis=1) train_std = np.std(train_scores, axis=1) test_mean = np.mean(test_scores, axis=1) test_std = np.std(test_scores, axis=1) plt.plot(train_sizes, train_mean, label='train score', color='blue', marker='o') plt.fill_between(train_sizes, train_mean + train_std, train_mean - train_std, color='blue', alpha=alpha) plt.plot(train_sizes, test_mean, label='test score', color='red', marker='o') plt.fill_between(train_sizes, test_mean + test_std, test_mean - test_std, color='red', alpha=alpha) plt.title(title) plt.xlabel('Number of training points') plt.ylabel('F-measure') plt.grid(ls='--') plt.legend(loc='best') plt.show() def plot_validation_curve(param_range, train_scores, test_scores, title, alpha=0.1): train_mean = np.mean(train_scores, axis=1) train_std = np.std(train_scores, axis=1) test_mean = np.mean(test_scores, axis=1) test_std = np.std(test_scores, axis=1) plt.plot(param_range, train_mean, label='train score', color='blue', marker='o') plt.fill_between(param_range, train_mean + train_std, train_mean - train_std, color='blue', alpha=alpha) plt.plot(param_range, test_mean, label='test score', color='red', marker='o') plt.fill_between(param_range, test_mean + test_std, test_mean - test_std, color='red', alpha=alpha) plt.title(title) plt.grid(ls='--') plt.xlabel('Parameter value') plt.ylabel('F-measure') plt.legend(loc='best') plt.show() X, y = make_classification(n_classes=2, class_sep=2,weights=[0.9, 0.1], n_informative=3, n_redundant=1, flip_y=0, n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10) print('Original dataset shape {}'.format(Counter(y))) ln = X.shape names = ["x%s" % i for i in range(1, ln[1] + 1)] X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0) st=StandardScaler() rg = LogisticRegression(class_weight = { 0:1, 1:6.5 }, random_state = 42, solver = 'saga',max_iter=100,n_jobs=-1) param_grid = {'clf__C': [0.001,0.01,0.1,0.002,0.02,0.005,0.0007,.0006,0.0005], 'clf__class_weight':[{ 0:1, 1:6 },{ 0:1, 1:4 },{ 0:1, 1:5.5 },{ 0:1, 1:4.5 },{ 0:1, 1:5 }] } pipeline = Pipeline(steps=[('scaler', st), ('clf', rg )]) cv=StratifiedKFold(n_splits=5,random_state=42) rg_cv = GridSearchCV(pipeline, param_grid, cv=cv, scoring = 'f1') rg_cv.fit(X_train, y_train) print("Tuned rg best params: {}".format(rg_cv.best_params_)) ypred = rg_cv.predict(X_train) print(classification_report(y_train, ypred)) print('######################') ypred2 = rg_cv.predict(X_test) print(classification_report(y_test, ypred2)) plt.figure(figsize=(9,6)) param_range1=[i / 10000.0 for i in range(1, 11)] param_range2=[{ 0:1, 1:6 },{ 0:1, 1:4 },{ 0:1, 1:5.5 },{ 0:1, 1:4.5 },{ 0:1, 1:5 }] if __name__ == '__main__': train_sizes, train_scores, test_scores = learning_curve( estimator= rg_cv.best_estimator_ , X= X_train, y = y_train, train_sizes=np.arange(0.1,1.1,0.1), cv= cv, scoring='f1', n_jobs= - 1) plot_learning_curve(train_sizes, train_scores, test_scores, title='Learning curve for Logistic Regression') train_scores, test_scores = validation_curve( estimator=rg_cv.best_estimator_, X=X_train, y=y_train, param_name="clf__C", param_range=param_range1, cv=cv, scoring="f1", n_jobs=-1) plot_validation_curve(param_range1, train_scores, test_scores, title="Validation Curve for C", alpha=0.1) train_scores, test_scores = validation_curve( estimator=rg_cv.best_estimator_, X=X_train, y=y_train, param_name="clf__class_weight", param_range=param_range2, cv=cv, scoring="f1", n_jobs=-1) plot_validation_curve(param_range2, train_scores, test_scores, title="Validation Curve for class_weight", alpha=0.1) 1- Why when the best estimator of GridSearchCv is passed into the learning curve function, it prints all the previous print lines several times (run on windows)? 2- How to plot validation curve for class weight? TypeError: float() argument must be a string or a number, not 'dict' Thanks in advance. Best regards, -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Thu Mar 29 04:50:08 2018 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Thu, 29 Mar 2018 10:50:08 +0200 Subject: [scikit-learn] =?utf-8?q?I=E2=80=99m_in_trouble_and_I_need_your_?= =?utf-8?q?advice_on_operating_scikit-learn?= In-Reply-To: References: Message-ID: > However, these are typically considered "machine learning" techniques; when someone says "AI", they typically mean a Neural Network. I am sorry but I disagree: https://en.wikipedia.org/wiki/Artificial_intelligence On 29 March 2018 at 06:47, Andrew Howe wrote: > Hi Jinwoo > > It is true that scikit-learn has many models for supervised classification > tasks, and it should be relatively trivial for you to munge your 3 data > files into the X (data) y (labels) format required for these methods. > Examples are k-means, Support Vector Machines, Decision Trees, and > Discriminant Analysis. However, these are typically considered "machine > learning" techniques; when someone says "AI", they typically mean a Neural > Network. If you wish to use scikit-learn for Neural Network > classification, you are limited to the Multilayer Perceptron: > http://scikit-learn.org/stable/modules/neural_networks_supervised.html#. > If you want to be able to use more advanced Neural Networks, here are some > options: > > *Deep neural networks etc.* > > - pylearn2 A deep > learning and neural network library build on theano with scikit-learn like > interface. > - sklearn_theano scikit-learn > compatible estimators, transformers, and datasets which use Theano > internally > - nolearn A number of wrappers and > abstractions around existing neural network libraries > - keras Deep Learning library > capable of running on top of either TensorFlow or Theano. > - lasagne A lightweight library > to build and train neural networks in Theano. > > I personally use Google's TensorFlow. Hope this helps. > > Andrew > > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > J. Andrew Howe, PhD > LinkedIn Profile > ResearchGate Profile > Open Researcher and Contributor ID (ORCID) > > Github Profile > Personal Website > I live to learn, so I can learn to live. - me > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > > On Thu, Mar 29, 2018 at 7:06 AM, PARK Jinwoo wrote: > >> Dear scikit-learn experts >> >> Hello, I am a graduate school student majoring in doping control >> analysis in Korea. >> Now I'm in a research institute that carries out doping control analyses. >> >> I received a project by my advising doctor. It's about operating an AI >> project. >> A workshop is scheduled in April, so it needs to be done in a month. >> However, I haven't learn computer science at all and I'm totally ignorant >> of it. >> So I desperately need your advice. >> >> To be specific, the 3 xml files shown in the picture are analysis results >> named positive, negative, and unknown from top to bottom. >> We'd like to let AI learn positive and negative data, >> input unknown datum, and then see what result will turn out. >> >> I came to know that there's a module called 'iris calssification' in >> scikit-learn >> and I'm thinking of utilizing that as it seems similar with my assignment >> However, while the database of iris is a csv file with 150 data and >> labels inside, >> what I have are 3 xml files each one of which represents one data, >> which are stored in C:\Users\Jinwoo\Documents\Python Scripts\mzdata >> The training process is not shuffling randomly the 150 data and >> dividing into training set and test set. The data are already assigned >> into training ones and testing one. >> Also, when training the program, training labels naming positive and >> negative should be inserted on my own. >> >> What I know all is that it will be appropriate to use fit() function >> and predict() function to train and test. >> But I have no idea on what to import, how to write codes correctly, and >> so on >> >> It will be thankful to give me some help. >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From robbenson18 at gmail.com Thu Mar 29 05:16:32 2018 From: robbenson18 at gmail.com (Roberto Guidotti) Date: Thu, 29 Mar 2018 11:16:32 +0200 Subject: [scikit-learn] Get parameters of classes in a Pipeline within cross_validate Message-ID: Hi scikit-learners, I have a simple Pipeline with Feature Selection and SVC classifier and I use it in a cross validation schema with cross_validate / cross_validation_score functions. I need to extract the selected features for each fold of the CV and in general get information about the fitted elements of the pipeline in each of the CV fold. Is there a way to get these information (e.g. fs.get_support() or fs.scores_) or I need to build my own cross_validate function? Thank you, Roberto -- Ing. Roberto Guidotti, PhD. PostDoc Fellow Institute for Advanced Biomedical Technologies - ITAB Department of Neuroscience and Imaging University of Chieti "G. D'Annunzio" Via dei Vestini, 33 66013 Chieti, Italy tel: +39 0871 3556919 e-mail: r.guidotti at unich.it; rguidotti at acm.org linkedin: http://it.linkedin.com/in/robertogui/ twitter: @robbisg github: https://github.com/robbisg -------------- next part -------------- An HTML attachment was scrubbed... URL: From william.de-vazelhes at inria.fr Thu Mar 29 07:15:42 2018 From: william.de-vazelhes at inria.fr (wdevazel) Date: Thu, 29 Mar 2018 13:15:42 +0200 Subject: [scikit-learn] Problem with check_estimator for distance metric learning In-Reply-To: References: Message-ID: Hi all, We are currently trying to add to the metric-learn package (https://github.com/metric-learn/metric-learn) a feature that would allow to do cross-validation with Weakly Supervised Metric Learners using scikit-learn's cross-validation routines. Distance Metric Learning algorithms learn distance metrics between samples, using some supervised information about similarity between training samples. Some Metric Learning algorithms are weakly supervised (Weakly Supervised Metric Learners), i.e. they do not train on labeled samples, but for instance on labeled *pairs* of samples (the label telling whether the pair is of similar or dissimilar samples). To cross-validate these algorithms, we make a train and a test by splitting on the pairs. Indeed a use case of metric learning is to classify at test time unseen pairs as similar or dissimilar (those pairs can involve already seen samples). For that, we made a dataset representation that allows to easily slice on pairs of samples: we mock a 3D array containing pairs of samples, that would be of shape (n_constraints, 2, n_features) (each line is a pair of samples). We do so with an object that we called ConstrainedDataset, which is more memory efficient than the described array (because samples would be duplicated through pairs). Now we have a problem when running scikit-learn's *check_estimator* on these algorithms, because it launches a series of tests where the estimator takes as input regular arrays, whereas Weakly Supervised Metric Learners always learn on ConstrainedDatasets (or more generally on pairs, or tuples for some other algorithms). We therefore thought of two main possibilities (that could be combined) to solve this problem: - taking the maximum number of tests yielded by check_estimator that pass in our setting, and modifying the others by replacing array inputs with ConstrainedDatasets - wrapping a Weakly Supervised Metric Learner into a MockSklearnEstimator that would transform any array as input into a ConstrainedDataset before passing it to the underlying Weakly Supervised Metric Learner However these options are not really satisfying: the first one will create a lot of code and after that one cannot see at a glance if the estimator passes scikit-learn's check_estimator, and the second adds so much wrapping that we are not even really testing the Weakly Supervised Metric Learner) For more information, see this PR where the new feature is being implemented, including the constraints.ConstrainedDataset object, as well as a comment on what is problematic when using scikit-learn's check_estimator: https://github.com/metric-learn/metric-learn/pull/85#issuecomment-375659820 Any advice about how to design the weakly supervised algorithms, the data structure containing the pairs of samples, or how to use anyway scikit-learn's check_estimator would be appreciated! Thanks! Best regards, William -------------- next part -------------- An HTML attachment was scrubbed... URL: From william.de-vazelhes at inria.fr Thu Mar 29 07:49:34 2018 From: william.de-vazelhes at inria.fr (wdevazel) Date: Thu, 29 Mar 2018 13:49:34 +0200 Subject: [scikit-learn] problem with check_estimator for distance metric learning Message-ID: <9c4b353d-c6a0-e048-1678-7ab5cba69cd5@inria.fr> (Sorry, I sent this mail as a reply instead of starting a new thread... Here is the new thread.) Hi all, We are currently trying to add to the metric-learn package (https://github.com/metric-learn/metric-learn) a feature that would allow to do cross-validation with Weakly Supervised Metric Learners using scikit-learn's cross-validation routines. Distance Metric Learning algorithms learn distance metrics between samples, using some supervised information about similarity between training samples. Some Metric Learning algorithms are weakly supervised (Weakly Supervised Metric Learners), i.e. they do not train on labeled samples, but for instance on labeled *pairs* of samples (the label telling whether the pair is of similar or dissimilar samples). To cross-validate these algorithms, we make a train and a test by splitting on the pairs. Indeed a use case of metric learning is to classify at test time unseen pairs as similar or dissimilar (those pairs can involve already seen samples). For that, we made a dataset representation that allows to easily slice on pairs of samples: we mock a 3D array containing pairs of samples, that would be of shape (n_constraints, 2, n_features) (each line is a pair of samples). We do so with an object that we called ConstrainedDataset, which is more memory efficient than the described array (because samples would be duplicated through pairs). Now we have a problem when running scikit-learn's *check_estimator* on these algorithms, because it launches a series of tests where the estimator takes as input regular arrays, whereas Weakly Supervised Metric Learners always learn on ConstrainedDatasets (or more generally on pairs, or tuples for some other algorithms). We therefore thought of two main possibilities (that could be combined) to solve this problem: - taking the maximum number of tests yielded by check_estimator that pass in our setting, and modifying the others by replacing array inputs with ConstrainedDatasets - wrapping a Weakly Supervised Metric Learner into a MockSklearnEstimator that would transform any array as input into a ConstrainedDataset before passing it to the underlying Weakly Supervised Metric Learner However these options are not really satisfying: the first one will create a lot of code and after that one cannot see at a glance if the estimator passes scikit-learn's check_estimator, and the second adds so much wrapping that we are not even really testing the Weakly Supervised Metric Learner) For more information, see this PR where the new feature is being implemented, including the constraints.ConstrainedDataset object, as well as a comment on what is problematic when using scikit-learn's check_estimator: https://github.com/metric-learn/metric-learn/pull/85#issuecomment-375659820 Any advice about how to design the weakly supervised algorithms, the data structure containing the pairs of samples, or how to use anyway scikit-learn's check_estimator would be appreciated! Thanks! Best regards, William -------------- next part -------------- An HTML attachment was scrubbed... URL: From nicholdav at gmail.com Thu Mar 29 13:44:38 2018 From: nicholdav at gmail.com (David Nicholson) Date: Thu, 29 Mar 2018 13:44:38 -0400 Subject: [scikit-learn] =?utf-8?q?I=E2=80=99m_in_trouble_and_I_need_your_?= =?utf-8?q?advice_on_operating_scikit-learn?= In-Reply-To: References: Message-ID: Hey Jinwoo, Sounds like you're in a tough situation. Not sure why people are responding with discussions of the true meaning of AI. As far as getting things out of xml goes, you can use the module ElemenTree that's in the standard Python library. https://eli.thegreenplace.net/2012/03/15/processing-xml-in-python-with-elementtree/ As far as learning how to write code and what import statements, there's a lot of free resources on the web: https://github.com/jakevdp/PythonDataScienceHandbook There's also video tutorials on YouTube such as https://youtu.be/2kT6QOVSgSg I don't want to speak for others but I think on this listserv would not be the best place to get help with learning how to write code and what import statements to use. You might start with a Stack Overflow post and tag it with 'scikit-learn' if you want to get help faster. Hope that helps, David On Thu, Mar 29, 2018 at 4:50 AM, Guillaume Lema?tre wrote: > > However, these are typically considered "machine learning" techniques; > when someone says "AI", they typically mean a Neural Network. > > I am sorry but I disagree: https://en.wikipedia.org/wiki/ > Artificial_intelligence > > On 29 March 2018 at 06:47, Andrew Howe wrote: > >> Hi Jinwoo >> >> It is true that scikit-learn has many models for supervised >> classification tasks, and it should be relatively trivial for you to munge >> your 3 data files into the X (data) y (labels) format required for these >> methods. Examples are k-means, Support Vector Machines, Decision Trees, >> and Discriminant Analysis. However, these are typically considered >> "machine learning" techniques; when someone says "AI", they typically mean >> a Neural Network. If you wish to use scikit-learn for Neural Network >> classification, you are limited to the Multilayer Perceptron: >> http://scikit-learn.org/stable/modules/neural_networks_supervised.html#. >> If you want to be able to use more advanced Neural Networks, here are some >> options: >> >> *Deep neural networks etc.* >> >> - pylearn2 A deep >> learning and neural network library build on theano with scikit-learn like >> interface. >> - sklearn_theano scikit-learn >> compatible estimators, transformers, and datasets which use Theano >> internally >> - nolearn A number of wrappers >> and abstractions around existing neural network libraries >> - keras Deep Learning library >> capable of running on top of either TensorFlow or Theano. >> - lasagne A lightweight library >> to build and train neural networks in Theano. >> >> I personally use Google's TensorFlow. Hope this helps. >> >> Andrew >> >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~> >> J. Andrew Howe, PhD >> LinkedIn Profile >> ResearchGate Profile >> Open Researcher and Contributor ID (ORCID) >> >> Github Profile >> Personal Website >> I live to learn, so I can learn to live. - me >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~> >> >> On Thu, Mar 29, 2018 at 7:06 AM, PARK Jinwoo wrote: >> >>> Dear scikit-learn experts >>> >>> Hello, I am a graduate school student majoring in doping control >>> analysis in Korea. >>> Now I'm in a research institute that carries out doping control analyses. >>> >>> I received a project by my advising doctor. It's about operating an AI >>> project. >>> A workshop is scheduled in April, so it needs to be done in a month. >>> However, I haven't learn computer science at all and I'm totally >>> ignorant of it. >>> So I desperately need your advice. >>> >>> To be specific, the 3 xml files shown in the picture are analysis results >>> named positive, negative, and unknown from top to bottom. >>> We'd like to let AI learn positive and negative data, >>> input unknown datum, and then see what result will turn out. >>> >>> I came to know that there's a module called 'iris calssification' in >>> scikit-learn >>> and I'm thinking of utilizing that as it seems similar with my assignment >>> However, while the database of iris is a csv file with 150 data and >>> labels inside, >>> what I have are 3 xml files each one of which represents one data, >>> which are stored in C:\Users\Jinwoo\Documents\Python Scripts\mzdata >>> The training process is not shuffling randomly the 150 data and >>> dividing into training set and test set. The data are already assigned >>> into training ones and testing one. >>> Also, when training the program, training labels naming positive and >>> negative should be inserted on my own. >>> >>> What I know all is that it will be appropriate to use fit() function >>> and predict() function to train and test. >>> But I have no idea on what to import, how to write codes correctly, and >>> so on >>> >>> It will be thankful to give me some help. >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- David Nicholson, Ph.D. nickledave.github.io https://github.com/NickleDave Sober Lab -------------- next part -------------- An HTML attachment was scrubbed... URL: From mkselvak at sfu.ca Fri Mar 30 22:38:03 2018 From: mkselvak at sfu.ca (Manoj Karthick) Date: Fri, 30 Mar 2018 19:38:03 -0700 Subject: [scikit-learn] Error random_state parameter changed by estimator Message-ID: I am working on adding a new estimator to the scikit-learn library, but the make command always exits with the below error message: AssertionError: Estimator XYZ should not change or mutate the parameter random_state from 0 to during fit. Can you help me understand what the issue is? Error log: self = msg = ?Estimator XYZ should not change or mutate the parameter random_state from 0 to during fit.' def fail(self, msg=None): """Fail immediately, with the given message.""" > raise self.failureException(msg) E AssertionError: Estimator XYZ should not change or mutate the parameter random_state from 0 to during fit. msg = 'Estimator XYZ should not change or mutate the parameter random_state from 0 to during fit.' self = Thanks in advance, Manoj Karthick Selva Kumar -------------- next part -------------- An HTML attachment was scrubbed... URL: