From t3kcit at gmail.com Tue Nov 1 10:05:54 2016 From: t3kcit at gmail.com (Andy) Date: Tue, 1 Nov 2016 10:05:54 -0400 Subject: [scikit-learn] creating a custom scoring function for cross-validation of classification In-Reply-To: References: Message-ID: Hi. If you want to pass a custom scorer, you need to pass the scorer, not a string with the scorer name. Andy On 10/31/2016 04:28 PM, Sumeet Sandhu wrote: > Hi, > > I've been staring at various doc pages for a while to create a custom > scorer that uses predict_proba output of a multi-class SGDClassifier : > http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score > http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter > http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html#sklearn.metrics.make_scorer > > I got the impression I could customize the "scoring'' parameter in > cross_val_score directly, but that didn't work. > Then I tried customizing the "score_func" parameter in make_scorer, > but that didn't work either. Both errors are ValuErrors : > > Traceback (most recent call last): > File "", line 3, in > accuracy = mean(cross_val_score(LRclassifier, trainPatentVecs, > trainLabelVecs, cv=10, scoring = 'topNscorer')) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/cross_validation.py", > line 1425, in cross_val_score > scorer = check_scoring(estimator, scoring=scoring) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/metrics/scorer.py", > line 238, in check_scoring > return get_scorer(scoring) > File > "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/metrics/scorer.py", > line 197, in get_scorer > % (scoring, sorted(SCORERS.keys()))) > ValueError: 'topNscorer' is not a valid scoring value. Valid options > are ['accuracy', 'adjusted_rand_score', 'average_precision', 'f1', > 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', > 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', > 'precision', 'precision_macro', 'precision_micro', > 'precision_samples', 'precision_weighted', 'r2', 'recall', > 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', > 'roc_auc'] > > At a high level, I want to find out if the true label was found in the > top N multi-class labels coming out of an SGD classifier. Built-in > scores like "accuracy" only look at N=1. > > Here is the code using make_scorer : > LRclassifier = SGDClassifier(loss='log') > topNscorer = make_scorer(topNscoring, greater_is_better=True, > needs_proba=True) > accuracyN = mean(cross_val_score(LRclassifier, Data, Labels, > scoring = 'topNscorer')) > > Here is the code for the custom scoring function : > def topNscoring(y, yp): > ## Inputs y = true label per sample, yp = predict_proba > probabilities of all labels per sample > N = 5 > foundN = [] > for ii in xrange(0,shape(yp)[0]): > indN = [ w[0] for w in > sorted(enumerate(list(yp[ii,:])),key=lambda w:w[1],reverse=True)[0:N] ] > if y[ii] in indN: foundN.append(1) > else: foundN.append(0) > return mean(foundN) > > Any help will be greatly appreciated. > > best regards, > Sumeet > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From m.waseem.ahmad at gmail.com Tue Nov 1 12:50:36 2016 From: m.waseem.ahmad at gmail.com (muhammad waseem) Date: Tue, 1 Nov 2016 16:50:36 +0000 Subject: [scikit-learn] SVM number of support vectors Message-ID: Hello All, I am trying to replicate the below figure and wanted to confirm that number of support vectors can be calculated by *support_vectors_* attribute in scikitlearn? [image: Inline image 1] Regards Waseem -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 24829 bytes Desc: not available URL: From sumeet.k.sandhu at gmail.com Tue Nov 1 12:52:35 2016 From: sumeet.k.sandhu at gmail.com (Sumeet Sandhu) Date: Tue, 1 Nov 2016 09:52:35 -0700 Subject: [scikit-learn] creating a custom scoring function for cross-validation of classification In-Reply-To: References: Message-ID: ahha - thanks Andy ! that works... On Tue, Nov 1, 2016 at 7:05 AM, Andy wrote: > Hi. > If you want to pass a custom scorer, you need to pass the scorer, not a > string with the scorer name. > Andy > > > On 10/31/2016 04:28 PM, Sumeet Sandhu wrote: > > Hi, > > I've been staring at various doc pages for a while to create a custom > scorer that uses predict_proba output of a multi-class SGDClassifier : > http://scikit-learn.org/stable/modules/generated/ > sklearn.model_selection.cross_val_score.html#sklearn.model_ > selection.cross_val_score > http://scikit-learn.org/stable/modules/model_evaluation.html#scoring- > parameter > http://scikit-learn.org/stable/modules/generated/ > sklearn.metrics.make_scorer.html#sklearn.metrics.make_scorer > > I got the impression I could customize the "scoring'' parameter in > cross_val_score directly, but that didn't work. > Then I tried customizing the "score_func" parameter in make_scorer, but > that didn't work either. Both errors are ValuErrors : > > Traceback (most recent call last): > File "", line 3, in > accuracy = mean(cross_val_score(LRclassifier, trainPatentVecs, > trainLabelVecs, cv=10, scoring = 'topNscorer')) > File "/Library/Frameworks/Python.framework/Versions/2.7/lib/ > python2.7/site-packages/sklearn/cross_validation.py", line 1425, in > cross_val_score > scorer = check_scoring(estimator, scoring=scoring) > File "/Library/Frameworks/Python.framework/Versions/2.7/lib/ > python2.7/site-packages/sklearn/metrics/scorer.py", line 238, in > check_scoring > return get_scorer(scoring) > File "/Library/Frameworks/Python.framework/Versions/2.7/lib/ > python2.7/site-packages/sklearn/metrics/scorer.py", line 197, in > get_scorer > % (scoring, sorted(SCORERS.keys()))) > ValueError: 'topNscorer' is not a valid scoring value. Valid options are > ['accuracy', 'adjusted_rand_score', 'average_precision', 'f1', 'f1_macro', > 'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', 'mean_absolute_error', > 'mean_squared_error', 'median_absolute_error', 'precision', > 'precision_macro', 'precision_micro', 'precision_samples', > 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', > 'recall_samples', 'recall_weighted', 'roc_auc'] > > At a high level, I want to find out if the true label was found in the top > N multi-class labels coming out of an SGD classifier. Built-in scores like > "accuracy" only look at N=1. > > Here is the code using make_scorer : > LRclassifier = SGDClassifier(loss='log') > topNscorer = make_scorer(topNscoring, greater_is_better=True, > needs_proba=True) > accuracyN = mean(cross_val_score(LRclassifier, Data, Labels, > scoring = 'topNscorer')) > > Here is the code for the custom scoring function : > def topNscoring(y, yp): > ## Inputs y = true label per sample, yp = predict_proba probabilities > of all labels per sample > N = 5 > foundN = [] > for ii in xrange(0,shape(yp)[0]): > indN = [ w[0] for w in sorted(enumerate(list(yp[ii,:])),key=lambda > w:w[1],reverse=True)[0:N] ] > if y[ii] in indN: foundN.append(1) > else: foundN.append(0) > return mean(foundN) > > Any help will be greatly appreciated. > > best regards, > Sumeet > > > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Nov 2 12:10:40 2016 From: t3kcit at gmail.com (Andy) Date: Wed, 2 Nov 2016 12:10:40 -0400 Subject: [scikit-learn] Fwd: libmf bindings In-Reply-To: References: Message-ID: <6f7aa766-a6e3-ae85-94b7-5d113b56ae55@gmail.com> -------- Forwarded Message -------- Subject: libmf bindings Date: Wed, 2 Nov 2016 11:38:00 -0400 From: sam royston To: scikit-learn-owner at python.org Hi, Thanks for all your hard work on this useful tool! I'm hoping to contribute bindings to Chih-Jen Lin's libmf: https://www.csie.ntu.edu.tw/~cjlin/libmf/ . It looks like you guys have functionality for NMF, but used only in the decomposition/ dimensionality reduction setting (and obviously only with non-negative values). Id like to add functionality in the form python wrappers for libmf, much like you have for Chih-Jen Lin's other libraries libsvm and liblinear. Libmf is very efficient and offers great functionality for missing data imputation, recommendation systems and more. I have already written bindings using ctypes, but I see that you have you Cython for libsvm and liblinear - is it necessary that I switch to that interface? Let me know what you think of a contribution like this. Thanks, Sam -------------- next part -------------- An HTML attachment was scrubbed... URL: From drraph at gmail.com Wed Nov 2 12:25:46 2016 From: drraph at gmail.com (Raphael C) Date: Wed, 2 Nov 2016 16:25:46 +0000 Subject: [scikit-learn] Fwd: libmf bindings In-Reply-To: <6f7aa766-a6e3-ae85-94b7-5d113b56ae55@gmail.com> References: <6f7aa766-a6e3-ae85-94b7-5d113b56ae55@gmail.com> Message-ID: (I am not a scikit learn dev.) This is a great idea and I for one look forward to using it. My understanding is that libmf optimises only over the observed values (that is the explicitly given values in a sparse matrix) as is typically needed for recommender system whereas the scikit learn NMF code assumes that any non-specified value in a sparse matrix is zero. It is worth bearing that in mind in any comparison that is carried out. Raphael On 2 November 2016 at 16:10, Andy wrote: > > > > -------- Forwarded Message -------- > Subject: libmf bindings > Date: Wed, 2 Nov 2016 11:38:00 -0400 > From: sam royston > To: scikit-learn-owner at python.org > > Hi, > > Thanks for all your hard work on this useful tool! I'm hoping to > contribute bindings to Chih-Jen Lin's libmf: https://www.csie.ntu. > edu.tw/~cjlin/libmf/. It looks like you guys have functionality for NMF, > but used only in the decomposition/ dimensionality reduction setting (and > obviously only with non-negative values). Id like to add functionality in > the form python wrappers for libmf, much like you have for Chih-Jen Lin's > other libraries libsvm and liblinear. > > Libmf is very efficient and offers great functionality for missing data > imputation, recommendation systems and more. > > I have already written bindings using ctypes, but I see that you have you > Cython for libsvm and liblinear - is it necessary that I switch to that > interface? > > Let me know what you think of a contribution like this. > > Thanks, > Sam > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Wed Nov 2 12:32:15 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Wed, 2 Nov 2016 17:32:15 +0100 Subject: [scikit-learn] Fwd: libmf bindings In-Reply-To: <6f7aa766-a6e3-ae85-94b7-5d113b56ae55@gmail.com> References: <6f7aa766-a6e3-ae85-94b7-5d113b56ae55@gmail.com> Message-ID: <20161102163215.GF3067723@phare.normalesup.org> Given that we'd love to get rid of our libsvm/liblinear biddings, I would be more in favor of improving our matrix factorization code rather than including this code. That said, +1 for missing data imputation with matrix factorization, once we're done with the current PRs on missing data. Ga?l On Wed, Nov 02, 2016 at 12:10:40PM -0400, Andy wrote: > -------- Forwarded Message -------- > Subject: libmf bindings > Date: Wed, 2 Nov 2016 11:38:00 -0400 > From: sam royston > To: scikit-learn-owner at python.org > Hi, > Thanks for all your hard work on this useful tool! I'm hoping to contribute > bindings to Chih-Jen Lin's libmf:?https://www.csie.ntu.edu.tw/~cjlin/libmf/. It > looks like you guys have functionality for NMF, but used only in the > decomposition/ dimensionality reduction setting (and obviously only with > non-negative values). Id like to add functionality in the form python wrappers > for libmf, much like you have for Chih-Jen Lin's other libraries libsvm and > liblinear. > Libmf is very efficient and offers great functionality for missing data > imputation, recommendation systems and more. > I have already written bindings using ctypes, but I see that you have you > Cython for libsvm and liblinear - is it necessary that I switch to that > interface? > Let me know what you think of a contribution like this. > Thanks, > Sam > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From jalopcar at gmail.com Thu Nov 3 11:16:39 2016 From: jalopcar at gmail.com (Jaime Lopez Carvajal) Date: Thu, 3 Nov 2016 10:16:39 -0500 Subject: [scikit-learn] hierarchical clustering Message-ID: Hi there, I am trying to do image classification using hierarchical clustering. So, I have my data, and apply this steps: from scipy.cluster.hierarchy import dendrogram, linkage data1 = np.array(data) Z = linkage(data, 'ward') dendrogram(Z, truncate_mode='lastp', p=12, show_leaf_counts=False, leaf_rotation=90., leaf_font_size=12.,show_contracted=True) plt.show() So, I can see the dendrogram with 12 clusters as I want, but I dont know how to use this to classify the image. Also, I understand that funtion cluster.hierarchy.cut_tree(Z, n_clusters), that cut the tree at that number of clusters, but again I dont know how to procedd from there. I would like to have something like: cluster = predict(Z, instance) Any advice or direction would be really appreciate, Thanks in advance, Jaime -- *Jaime Lopez Carvajal* -------------- next part -------------- An HTML attachment was scrubbed... URL: From jni.soma at gmail.com Thu Nov 3 18:00:27 2016 From: jni.soma at gmail.com (Juan Nunez-Iglesias) Date: Fri, 4 Nov 2016 09:00:27 +1100 Subject: [scikit-learn] hierarchical clustering In-Reply-To: References: Message-ID: Hi Jaime, >From *Elegant SciPy*: """ The *fcluster* function takes a linkage matrix, as returned by linkage, and a threshold, and returns cluster identities. It's difficult to know a-priori what the threshold should be, but we can obtain the appropriate threshold for a fixed number of clusters by checking the distances in the linkage matrix. from scipy.cluster.hierarchy import fcluster n_clusters = 3 threshold_distance = (Z[-n_clusters, 2] + Z[-n_clusters+1, 2]) / 2 clusters = fcluster(Z, threshold_distance, 'distance') """ As an aside, I imagine this question is better placed in the SciPy mailing list than scikit-learn (which has its own hierarchical clustering API). Juan. On Fri, Nov 4, 2016 at 2:16 AM, Jaime Lopez Carvajal wrote: > Hi there, > > I am trying to do image classification using hierarchical clustering. > So, I have my data, and apply this steps: > > from scipy.cluster.hierarchy import dendrogram, linkage > > data1 = np.array(data) > Z = linkage(data, 'ward') > dendrogram(Z, truncate_mode='lastp', p=12, show_leaf_counts=False, > leaf_rotation=90., leaf_font_size=12.,show_contracted=True) > plt.show() > > So, I can see the dendrogram with 12 clusters as I want, but I dont know > how to use this to classify the image. > Also, I understand that funtion cluster.hierarchy.cut_tree(Z, n_clusters), > that cut the tree at that number of clusters, but again I dont know how to > procedd from there. I would like to have something like: cluster = > predict(Z, instance) > > Any advice or direction would be really appreciate, > > Thanks in advance, Jaime > > > -- > > *Jaime Lopez Carvajal* > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jalopcar at gmail.com Thu Nov 3 18:12:55 2016 From: jalopcar at gmail.com (Jaime Lopez Carvajal) Date: Thu, 3 Nov 2016 17:12:55 -0500 Subject: [scikit-learn] hierarchical clustering In-Reply-To: References: Message-ID: Hi Juan, The fcluster function was that I needed. I can now proceed from here to classify images. Thank you very much, Jaime On Thu, Nov 3, 2016 at 5:00 PM, Juan Nunez-Iglesias wrote: > Hi Jaime, > > From *Elegant SciPy*: > > """ > The *fcluster* function takes a linkage matrix, as returned by linkage, > and a threshold, and returns cluster identities. It's difficult to know > a-priori what the threshold should be, but we can obtain the appropriate > threshold for a fixed number of clusters by checking the distances in the > linkage matrix. > > from scipy.cluster.hierarchy import fcluster > n_clusters = 3 > threshold_distance = (Z[-n_clusters, 2] + Z[-n_clusters+1, 2]) / 2 > clusters = fcluster(Z, threshold_distance, 'distance') > > """ > > As an aside, I imagine this question is better placed in the SciPy mailing > list than scikit-learn (which has its own hierarchical clustering API). > > Juan. > > On Fri, Nov 4, 2016 at 2:16 AM, Jaime Lopez Carvajal > wrote: > >> Hi there, >> >> I am trying to do image classification using hierarchical clustering. >> So, I have my data, and apply this steps: >> >> from scipy.cluster.hierarchy import dendrogram, linkage >> >> data1 = np.array(data) >> Z = linkage(data, 'ward') >> dendrogram(Z, truncate_mode='lastp', p=12, show_leaf_counts=False, >> leaf_rotation=90., leaf_font_size=12.,show_contracted=True) >> plt.show() >> >> So, I can see the dendrogram with 12 clusters as I want, but I dont know >> how to use this to classify the image. >> Also, I understand that funtion cluster.hierarchy.cut_tree(Z, >> n_clusters), that cut the tree at that number of clusters, but again I dont >> know how to procedd from there. I would like to have something like: >> cluster = predict(Z, instance) >> >> Any advice or direction would be really appreciate, >> >> Thanks in advance, Jaime >> >> >> -- >> >> *Jaime Lopez Carvajal* >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- *Jaime Lopez Carvajal* -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at gmail.com Fri Nov 4 05:28:13 2016 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Fri, 4 Nov 2016 10:28:13 +0100 Subject: [scikit-learn] hierarchical clustering In-Reply-To: References: Message-ID: <581C54AD.8040803@gmail.com> Hi Jaime, Alternatively, in scikit learn I think, you could use hac = AgglomerativeClustering(n_clusters, linkage="ward") hac.fit(data) clusters = hac.labels_ there in an example on how to plot a dendrogram from this in https://github.com/scikit-learn/scikit-learn/pull/3464 AgglomerativeClustering internally calls scikit learn's version of cut_tree. I would be curious to know whether this is equivalent to scipy's fcluster. Roman On 03/11/16 23:12, Jaime Lopez Carvajal wrote: > Hi Juan, > > The fcluster function was that I needed. I can now proceed from here to > classify images. > Thank you very much, > > Jaime > > On Thu, Nov 3, 2016 at 5:00 PM, Juan Nunez-Iglesias > wrote: > > Hi Jaime, > > From /Elegant SciPy/: > > """ > The *fcluster* function takes a linkage matrix, as returned by > linkage, and a threshold, and returns cluster identities. It's > difficult to know a-priori what the threshold should be, but we can > obtain the appropriate threshold for a fixed number of clusters by > checking the distances in the linkage matrix. > > from scipy.cluster.hierarchy import fcluster > n_clusters = 3 > threshold_distance = (Z[-n_clusters, 2] + Z[-n_clusters+1, 2]) / 2 > clusters = fcluster(Z, threshold_distance, 'distance') > > """ > > As an aside, I imagine this question is better placed in the SciPy > mailing list than scikit-learn (which has its own hierarchical > clustering API). > > Juan. > > On Fri, Nov 4, 2016 at 2:16 AM, Jaime Lopez Carvajal > > wrote: > > Hi there, > > I am trying to do image classification using hierarchical > clustering. > So, I have my data, and apply this steps: > > from scipy.cluster.hierarchy import dendrogram, linkage > > data1 = np.array(data) > Z = linkage(data, 'ward') > dendrogram(Z, truncate_mode='lastp', p=12, > show_leaf_counts=False, leaf_rotation=90., > leaf_font_size=12.,show_contracted=True) > plt.show() > > So, I can see the dendrogram with 12 clusters as I want, but I > dont know how to use this to classify the image. > Also, I understand that funtion cluster.hierarchy.cut_tree(Z, > n_clusters), that cut the tree at that number of clusters, but > again I dont know how to procedd from there. I would like to > have something like: cluster = predict(Z, instance) > > Any advice or direction would be really appreciate, > > Thanks in advance, Jaime > > > -- > /*Jaime Lopez Carvajal > */ > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > -- > /*Jaime Lopez Carvajal > */ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From gael.varoquaux at normalesup.org Fri Nov 4 05:36:49 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Fri, 4 Nov 2016 10:36:49 +0100 Subject: [scikit-learn] hierarchical clustering In-Reply-To: <581C54AD.8040803@gmail.com> References: <581C54AD.8040803@gmail.com> Message-ID: <20161104093649.GA137008@phare.normalesup.org> > AgglomerativeClustering internally calls scikit learn's version of > cut_tree. I would be curious to know whether this is equivalent to > scipy's fcluster. It differs in that it enable to add connectivity contraints. From m.marcinmichal at gmail.com Fri Nov 4 06:45:39 2016 From: m.marcinmichal at gmail.com (=?UTF-8?Q?Marcin_Miro=C5=84czuk?=) Date: Fri, 4 Nov 2016 11:45:39 +0100 Subject: [scikit-learn] Naive Bayes - Multinomial Naive Bayes tf-idf Message-ID: Hi, In our experiments, we use a Multinomial Naive Bayes (MNB). The traditional MNB implies the TF weight of the words. We read in documentation http://scikit-learn.org/stable/modules/naive_bayes.html which describes Multinomial Naive Bayes that "... where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice". The "word vector counts" is a TF and it is well known. We have a problem which the "tf-idf vectors". In this case, i.e. tf-idf it was implemented the approach of the D. M. Rennie et all Tackling the Poor Assumptions of Naive Bayes Text Classification? In the documentation, there are not any citation of this solution. Best, -- Marcin M. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jalopcar at gmail.com Fri Nov 4 09:15:37 2016 From: jalopcar at gmail.com (Jaime Lopez Carvajal) Date: Fri, 4 Nov 2016 08:15:37 -0500 Subject: [scikit-learn] hierarchical clustering In-Reply-To: <581C54AD.8040803@gmail.com> References: <581C54AD.8040803@gmail.com> Message-ID: Hi Roman, I will check that function too. Thanks for help. Have a good day, Jaime On Fri, Nov 4, 2016 at 4:28 AM, Roman Yurchak wrote: > Hi Jaime, > > Alternatively, in scikit learn I think, you could use > hac = AgglomerativeClustering(n_clusters, linkage="ward") > hac.fit(data) > clusters = hac.labels_ > there in an example on how to plot a dendrogram from this in > https://github.com/scikit-learn/scikit-learn/pull/3464 > > AgglomerativeClustering internally calls scikit learn's version of > cut_tree. I would be curious to know whether this is equivalent to > scipy's fcluster. > > Roman > > On 03/11/16 23:12, Jaime Lopez Carvajal wrote: > > Hi Juan, > > > > The fcluster function was that I needed. I can now proceed from here to > > classify images. > > Thank you very much, > > > > Jaime > > > > On Thu, Nov 3, 2016 at 5:00 PM, Juan Nunez-Iglesias > > wrote: > > > > Hi Jaime, > > > > From /Elegant SciPy/: > > > > """ > > The *fcluster* function takes a linkage matrix, as returned by > > linkage, and a threshold, and returns cluster identities. It's > > difficult to know a-priori what the threshold should be, but we can > > obtain the appropriate threshold for a fixed number of clusters by > > checking the distances in the linkage matrix. > > > > from scipy.cluster.hierarchy import fcluster > > n_clusters = 3 > > threshold_distance = (Z[-n_clusters, 2] + Z[-n_clusters+1, 2]) / 2 > > clusters = fcluster(Z, threshold_distance, 'distance') > > > > """ > > > > As an aside, I imagine this question is better placed in the SciPy > > mailing list than scikit-learn (which has its own hierarchical > > clustering API). > > > > Juan. > > > > On Fri, Nov 4, 2016 at 2:16 AM, Jaime Lopez Carvajal > > > wrote: > > > > Hi there, > > > > I am trying to do image classification using hierarchical > > clustering. > > So, I have my data, and apply this steps: > > > > from scipy.cluster.hierarchy import dendrogram, linkage > > > > data1 = np.array(data) > > Z = linkage(data, 'ward') > > dendrogram(Z, truncate_mode='lastp', p=12, > > show_leaf_counts=False, leaf_rotation=90., > > leaf_font_size=12.,show_contracted=True) > > plt.show() > > > > So, I can see the dendrogram with 12 clusters as I want, but I > > dont know how to use this to classify the image. > > Also, I understand that funtion cluster.hierarchy.cut_tree(Z, > > n_clusters), that cut the tree at that number of clusters, but > > again I dont know how to procedd from there. I would like to > > have something like: cluster = predict(Z, instance) > > > > Any advice or direction would be really appreciate, > > > > Thanks in advance, Jaime > > > > > > -- > > /*Jaime Lopez Carvajal > > */ > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > > > -- > > /*Jaime Lopez Carvajal > > */ > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- *Jaime Lopez Carvajal* -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Nov 4 10:43:36 2016 From: t3kcit at gmail.com (Andy) Date: Fri, 4 Nov 2016 09:43:36 -0500 Subject: [scikit-learn] Naive Bayes - Multinomial Naive Bayes tf-idf In-Reply-To: References: Message-ID: <68852b61-b1be-7e76-31e9-b5d8caac9b9f@gmail.com> On 11/04/2016 05:45 AM, Marcin Miro?czuk wrote: > Hi, > In our experiments, we use a Multinomial Naive Bayes (MNB). The > traditional MNB implies the TF weight of the words. We read in > documentation http://scikit-learn.org/stable/modules/naive_bayes.html > which describes Multinomial Naive Bayes that "... where the data are > typically represented as word vector counts, although tf-idf vectors > are also known to work well in practice". The "word vector counts" is > a TF and it is well known. We have a problem which the "tf-idf > vectors". In this case, i.e. tf-idf it was implemented the approach > of the D. M. Rennie et all Tackling the Poor Assumptions of Naive > Bayes Text Classification? In the documentation, there are not any > citation of this solution. No, I think that paper implements something slightly different. The documentation says that you can apply the TfidfVectorizer instead of CountVectorizer and it can still work. From brookm291 at gmail.com Fri Nov 4 16:43:59 2016 From: brookm291 at gmail.com (KevNo) Date: Sat, 05 Nov 2016 05:43:59 +0900 Subject: [scikit-learn] Recurrent Decision Tree Message-ID: <581CF30F.9040802@gmail.com> Just wondering if Recurrent Decision Tree has been investigated by Scikit previously. Main interest is in path dependant (time series data) problems, the recurrence is often necessary to model the path dependent state. In other words, wrong prediction will affect the subsequent predictions. Here, a research paper on Recurrent Decision Tree, from Walt Disney Research (!) https://goo.gl/APGpvM Any thought is welcome. Thanks Brookm scikit-learn-request at python.org wrote: > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Re: hierarchical clustering (Gael Varoquaux) > 2. Naive Bayes - Multinomial Naive Bayes tf-idf (Marcin Miro?czuk) > 3. Re: hierarchical clustering (Jaime Lopez Carvajal) > 4. Re: Naive Bayes - Multinomial Naive Bayes tf-idf (Andy) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 4 Nov 2016 10:36:49 +0100 > From: Gael Varoquaux > To: Scikit-learn user and developer mailing list > > Subject: Re: [scikit-learn] hierarchical clustering > Message-ID:<20161104093649.GA137008 at phare.normalesup.org> > Content-Type: text/plain; charset=us-ascii > >> AgglomerativeClustering internally calls scikit learn's version of >> cut_tree. I would be curious to know whether this is equivalent to >> scipy's fcluster. > > It differs in that it enable to add connectivity contraints. > > > ------------------------------ > > Message: 2 > Date: Fri, 4 Nov 2016 11:45:39 +0100 > From: Marcin Miro?czuk > To: scikit-learn at python.org > Subject: [scikit-learn] Naive Bayes - Multinomial Naive Bayes tf-idf > Message-ID: > > Content-Type: text/plain; charset="utf-8" > > Hi, > In our experiments, we use a Multinomial Naive Bayes (MNB). The traditional > MNB implies the TF weight of the words. We read in documentation > http://scikit-learn.org/stable/modules/naive_bayes.html which describes > Multinomial Naive Bayes that "... where the data are typically represented > as word vector counts, although tf-idf vectors are also known to work well > in practice". The "word vector counts" is a TF and it is well known. We > have a problem which the "tf-idf vectors". In this case, i.e. tf-idf it > was implemented the approach of the D. M. Rennie et all Tackling the Poor > Assumptions of Naive Bayes Text Classification? In the documentation, there > are not any citation of this solution. > > Best, > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Dale.T.Smith at macys.com Mon Nov 7 08:10:03 2016 From: Dale.T.Smith at macys.com (Dale T Smith) Date: Mon, 7 Nov 2016 13:10:03 +0000 Subject: [scikit-learn] Recurrent Decision Tree In-Reply-To: <581CF30F.9040802@gmail.com> References: <581CF30F.9040802@gmail.com> Message-ID: Searching the mailing list would be the best way to find out this information. It may be in the contrib packages on github ? have you checked? __________________________________________________________________________________________________________________________________________ Dale T. Smith | Macy's Systems and Technology | IFS eCom CSE Data Science 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of KevNo Sent: Friday, November 4, 2016 4:44 PM To: scikit-learn at python.org Subject: [scikit-learn] Recurrent Decision Tree ? EXT MSG: Just wondering if Recurrent Decision Tree has been investigated by Scikit previously. Main interest is in path dependant (time series data) problems, the recurrence is often necessary to model the path dependent state. In other words, wrong prediction will affect the subsequent predictions. Here, a research paper on Recurrent Decision Tree, from Walt Disney Research (!) https://goo.gl/APGpvM Any thought is welcome. Thanks Brookm scikit-learn-request at python.org wrote: Send scikit-learn mailing list submissions to scikit-learn at python.org To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/scikit-learn or, via email, send a message with subject or body 'help' to scikit-learn-request at python.org You can reach the person managing the list at scikit-learn-owner at python.org When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..." Today's Topics: 1. Re: hierarchical clustering (Gael Varoquaux) 2. Naive Bayes - Multinomial Naive Bayes tf-idf (Marcin Miro?czuk) 3. Re: hierarchical clustering (Jaime Lopez Carvajal) 4. Re: Naive Bayes - Multinomial Naive Bayes tf-idf (Andy) ---------------------------------------------------------------------- Message: 1 Date: Fri, 4 Nov 2016 10:36:49 +0100 From: Gael Varoquaux To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] hierarchical clustering Message-ID: <20161104093649.GA137008 at phare.normalesup.org> Content-Type: text/plain; charset=us-ascii AgglomerativeClustering internally calls scikit learn's version of cut_tree. I would be curious to know whether this is equivalent to scipy's fcluster. It differs in that it enable to add connectivity contraints. ------------------------------ Message: 2 Date: Fri, 4 Nov 2016 11:45:39 +0100 From: Marcin Miro?czuk To: scikit-learn at python.org Subject: [scikit-learn] Naive Bayes - Multinomial Naive Bayes tf-idf Message-ID: Content-Type: text/plain; charset="utf-8" Hi, In our experiments, we use a Multinomial Naive Bayes (MNB). The traditional MNB implies the TF weight of the words. We read in documentation http://scikit-learn.org/stable/modules/naive_bayes.html which describes Multinomial Naive Bayes that "... where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice". The "word vector counts" is a TF and it is well known. We have a problem which the "tf-idf vectors". In this case, i.e. tf-idf it was implemented the approach of the D. M. Rennie et all Tackling the Poor Assumptions of Naive Bayes Text Classification? In the documentation, there are not any citation of this solution. Best, * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ragvrv at gmail.com Mon Nov 7 09:51:11 2016 From: ragvrv at gmail.com (Raghav R V) Date: Mon, 7 Nov 2016 15:51:11 +0100 Subject: [scikit-learn] Recurrent Decision Tree In-Reply-To: References: <581CF30F.9040802@gmail.com> Message-ID: Hi, The reference paper seems pretty new with very few citations. Check our FAQ on inclusion criterion - http://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms On Mon, Nov 7, 2016 at 2:10 PM, Dale T Smith wrote: > Searching the mailing list would be the best way to find out this > information. > > > > It may be in the contrib packages on github ? have you checked? > > > > > > ____________________________________________________________ > ____________________________________________________________ > __________________ > *Dale T. Smith* *|* Macy's Systems and Technology *|* IFS eCom CSE Data > Science > 5985 State Bridge Road, Johns Creek, GA 30097 *|* dale.t.smith at macys.com > > > > *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith= > macys.com at python.org] *On Behalf Of *KevNo > *Sent:* Friday, November 4, 2016 4:44 PM > *To:* scikit-learn at python.org > *Subject:* [scikit-learn] Recurrent Decision Tree > > > > ? EXT MSG: > > Just wondering if Recurrent Decision Tree has been investigated > by Scikit previously. > > Main interest is in path dependant (time series data) problems, > the recurrence is often necessary to model the path dependent state. > In other words, wrong prediction will affect the subsequent predictions. > > Here, a research paper on Recurrent Decision Tree, > from Walt Disney Research (!) > > https://goo.gl/APGpvM > > > Any thought is welcome. > Thanks > Brookm > > > > > > scikit-learn-request at python.org wrote: > > Send scikit-learn mailing list submissions to > > scikit-learn at python.org > > > > To subscribe or unsubscribe via the World Wide Web, visit > > https://mail.python.org/mailman/listinfo/scikit-learn > > or, via email, send a message with subject or body 'help' to > > scikit-learn-request at python.org > > > > You can reach the person managing the list at > > scikit-learn-owner at python.org > > > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of scikit-learn digest..." > > > > > > Today's Topics: > > > > 1. Re: hierarchical clustering (Gael Varoquaux) > > 2. Naive Bayes - Multinomial Naive Bayes tf-idf (Marcin Miro?czuk) > > 3. Re: hierarchical clustering (Jaime Lopez Carvajal) > > 4. Re: Naive Bayes - Multinomial Naive Bayes tf-idf (Andy) > > > > > > ---------------------------------------------------------------------- > > > > Message: 1 > > Date: Fri, 4 Nov 2016 10:36:49 +0100 > > From: Gael Varoquaux > > To: Scikit-learn user and developer mailing list > > > > Subject: Re: [scikit-learn] hierarchical clustering > > Message-ID: <20161104093649.GA137008 at phare.normalesup.org> <20161104093649.GA137008 at phare.normalesup.org> > > Content-Type: text/plain; charset=us-ascii > > > > AgglomerativeClustering internally calls scikit learn's version of > > cut_tree. I would be curious to know whether this is equivalent to > > scipy's fcluster. > > > > It differs in that it enable to add connectivity contraints. > > > > > > ------------------------------ > > > > Message: 2 > > Date: Fri, 4 Nov 2016 11:45:39 +0100 > > From: Marcin Miro?czuk > > To: scikit-learn at python.org > > Subject: [scikit-learn] Naive Bayes - Multinomial Naive Bayes tf-idf > > Message-ID: > > > > Content-Type: text/plain; charset="utf-8" > > > > Hi, > > In our experiments, we use a Multinomial Naive Bayes (MNB). The traditional > > MNB implies the TF weight of the words. We read in documentation > > http://scikit-learn.org/stable/modules/naive_bayes.html which describes > > Multinomial Naive Bayes that "... where the data are typically represented > > as word vector counts, although tf-idf vectors are also known to work well > > in practice". The "word vector counts" is a TF and it is well known. We > > have a problem which the "tf-idf vectors". In this case, i.e. tf-idf it > > was implemented the approach of the D. M. Rennie et all Tackling the Poor > > Assumptions of Naive Bayes Text Classification? In the documentation, there > > are not any citation of this solution. > > > > Best, > > > > * This is an EXTERNAL EMAIL. Stop and think before clicking a link or > opening attachments. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Raghav RV https://github.com/raghavrv -------------- next part -------------- An HTML attachment was scrubbed... URL: From brookm291 at gmail.com Mon Nov 7 12:17:56 2016 From: brookm291 at gmail.com (KevNo) Date: Tue, 08 Nov 2016 02:17:56 +0900 Subject: [scikit-learn] Recurrent Decision Tree In-Reply-To: References: Message-ID: <5820B744.9080800@gmail.com> This is nothing to do with Scikit guidelines criteria.... This is about scientific/mathematic view Recurrent Decision Tree which is a specific tree by nature (you cannot apply standard algos on this). Suppose very little number of people has experience with recurrence in Decision Tree... scikit-learn-request at python.org wrote: > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Re: Recurrent Decision Tree (Raghav R V) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 7 Nov 2016 15:51:11 +0100 > From: Raghav R V > To: Scikit-learn user and developer mailing list > > Subject: Re: [scikit-learn] Recurrent Decision Tree > Message-ID: > > Content-Type: text/plain; charset="utf-8" > > Hi, > > The reference paper seems pretty new with very few citations. Check our FAQ > on inclusion criterion - > http://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms > > > On Mon, Nov 7, 2016 at 2:10 PM, Dale T Smith wrote: > >> Searching the mailing list would be the best way to find out this >> information. >> >> >> >> It may be in the contrib packages on github ? have you checked? >> >> >> >> >> >> ____________________________________________________________ >> ____________________________________________________________ >> __________________ >> *Dale T. Smith* *|* Macy's Systems and Technology *|* IFS eCom CSE Data >> Science >> 5985 State Bridge Road, Johns Creek, GA 30097 *|* dale.t.smith at macys.com >> >> >> >> *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith= >> macys.com at python.org] *On Behalf Of *KevNo >> *Sent:* Friday, November 4, 2016 4:44 PM >> *To:* scikit-learn at python.org >> *Subject:* [scikit-learn] Recurrent Decision Tree >> >> >> >> ? EXT MSG: >> >> Just wondering if Recurrent Decision Tree has been investigated >> by Scikit previously. >> >> Main interest is in path dependant (time series data) problems, >> the recurrence is often necessary to model the path dependent state. >> In other words, wrong prediction will affect the subsequent predictions. >> >> Here, a research paper on Recurrent Decision Tree, >> from Walt Disney Research (!) >> >> https://goo.gl/APGpvM >> >> >> Any thought is welcome. >> Thanks >> Brookm >> >> >> >> >> >> scikit-learn-request at python.org wrote: >> >> Send scikit-learn mailing list submissions to >> >> scikit-learn at python.org >> >> >> >> To subscribe or unsubscribe via the World Wide Web, visit >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> or, via email, send a message with subject or body 'help' to >> >> scikit-learn-request at python.org >> >> >> >> You can reach the person managing the list at >> >> scikit-learn-owner at python.org >> >> >> >> When replying, please edit your Subject line so it is more specific >> >> than "Re: Contents of scikit-learn digest..." >> >> >> >> >> >> Today's Topics: >> >> >> >> 1. Re: hierarchical clustering (Gael Varoquaux) >> >> 2. Naive Bayes - Multinomial Naive Bayes tf-idf (Marcin Miro?czuk) >> >> 3. Re: hierarchical clustering (Jaime Lopez Carvajal) >> >> 4. Re: Naive Bayes - Multinomial Naive Bayes tf-idf (Andy) >> >> >> >> >> >> ---------------------------------------------------------------------- >> >> >> >> Message: 1 >> >> Date: Fri, 4 Nov 2016 10:36:49 +0100 >> >> From: Gael Varoquaux >> >> To: Scikit-learn user and developer mailing list >> >> >> >> Subject: Re: [scikit-learn] hierarchical clustering >> >> Message-ID:<20161104093649.GA137008 at phare.normalesup.org> <20161104093649.GA137008 at phare.normalesup.org> >> >> Content-Type: text/plain; charset=us-ascii >> >> >> >> AgglomerativeClustering internally calls scikit learn's version of >> >> cut_tree. I would be curious to know whether this is equivalent to >> >> scipy's fcluster. >> >> >> >> It differs in that it enable to add connectivity contraints. >> >> >> >> >> >> ------------------------------ >> >> >> >> Message: 2 >> >> Date: Fri, 4 Nov 2016 11:45:39 +0100 >> >> From: Marcin Miro?czuk >> >> To: scikit-learn at python.org >> >> Subject: [scikit-learn] Naive Bayes - Multinomial Naive Bayes tf-idf >> >> Message-ID: >> >> >> >> Content-Type: text/plain; charset="utf-8" >> >> >> >> Hi, >> >> In our experiments, we use a Multinomial Naive Bayes (MNB). The traditional >> >> MNB implies the TF weight of the words. We read in documentation >> >> http://scikit-learn.org/stable/modules/naive_bayes.html which describes >> >> Multinomial Naive Bayes that "... where the data are typically represented >> >> as word vector counts, although tf-idf vectors are also known to work well >> >> in practice". The "word vector counts" is a TF and it is well known. We >> >> have a problem which the "tf-idf vectors". In this case, i.e. tf-idf it >> >> was implemented the approach of the D. M. Rennie et all Tackling the Poor >> >> Assumptions of Naive Bayes Text Classification? In the documentation, there >> >> are not any citation of this solution. >> >> >> >> Best, >> >> >> >> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or >> opening attachments. >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Mon Nov 7 13:08:51 2016 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Mon, 7 Nov 2016 10:08:51 -0800 Subject: [scikit-learn] Recurrent Decision Tree In-Reply-To: <5820B744.9080800@gmail.com> References: <5820B744.9080800@gmail.com> Message-ID: It hasn't been investigated by the sklearn team to my knowledge. As Dale said, there may be an independent implementation out there but not officially related to sklearn. On Mon, Nov 7, 2016 at 9:17 AM, KevNo wrote: > This is nothing to do with Scikit guidelines criteria.... > > This is about scientific/mathematic view Recurrent Decision Tree which is > a specific tree by nature > (you cannot apply standard algos on this). > > Suppose very little number of people has experience with recurrence in > Decision Tree... > > > > > > > > > scikit-learn-request at python.org wrote: > > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Re: Recurrent Decision Tree (Raghav R V) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 7 Nov 2016 15:51:11 +0100 > From: Raghav R V > To: Scikit-learn user and developer mailing list > > Subject: Re: [scikit-learn] Recurrent Decision Tree > Message-ID: > > Content-Type: text/plain; charset="utf-8" > > Hi, > > The reference paper seems pretty new with very few citations. Check our FAQ > on inclusion criterion -http://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms > > > On Mon, Nov 7, 2016 at 2:10 PM, Dale T Smith wrote: > > > Searching the mailing list would be the best way to find out this > information. > > > > It may be in the contrib packages on github ? have you checked? > > > > > > ____________________________________________________________ > ____________________________________________________________ > __________________ > *Dale T. Smith* *|* Macy's Systems and Technology *|* IFS eCom CSE Data > Science > 5985 State Bridge Road, Johns Creek, GA 30097 *|* dale.t.smith at macys.com > > > > *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith= macys.com at python.org] *On Behalf Of *KevNo > *Sent:* Friday, November 4, 2016 4:44 PM > *To:* scikit-learn at python.org > *Subject:* [scikit-learn] Recurrent Decision Tree > > > > ? EXT MSG: > > Just wondering if Recurrent Decision Tree has been investigated > by Scikit previously. > > Main interest is in path dependant (time series data) problems, > the recurrence is often necessary to model the path dependent state. > In other words, wrong prediction will affect the subsequent predictions. > > Here, a research paper on Recurrent Decision Tree, > from Walt Disney Research (!) > > https://goo.gl/APGpvM > > > Any thought is welcome. > Thanks > Brookm > > > > > scikit-learn-request at python.org wrote: > > Send scikit-learn mailing list submissions to > > scikit-learn at python.org > > > > To subscribe or unsubscribe via the World Wide Web, visit > > https://mail.python.org/mailman/listinfo/scikit-learn > > or, via email, send a message with subject or body 'help' to > > scikit-learn-request at python.org > > > > You can reach the person managing the list at > > scikit-learn-owner at python.org > > > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of scikit-learn digest..." > > > > > > Today's Topics: > > > > 1. Re: hierarchical clustering (Gael Varoquaux) > > 2. Naive Bayes - Multinomial Naive Bayes tf-idf (Marcin Miro?czuk) > > 3. Re: hierarchical clustering (Jaime Lopez Carvajal) > > 4. Re: Naive Bayes - Multinomial Naive Bayes tf-idf (Andy) > > > > > > ------------------------------------------------------------ > ---------- > > > > Message: 1 > > Date: Fri, 4 Nov 2016 10:36:49 +0100 > > From: Gael Varoquaux > > To: Scikit-learn user and developer mailing list > > > > Subject: Re: [scikit-learn] hierarchical clustering > > Message-ID: <20161104093649.GA137008 at phare.normalesup.org> <20161104093649.GA137008 at phare.normalesup.org> <20161104093649.GA137008 at phare.normalesup.org> <20161104093649.GA137008 at phare.normalesup.org> > > Content-Type: text/plain; charset=us-ascii > > > > AgglomerativeClustering internally calls scikit learn's version of > > cut_tree. I would be curious to know whether this is equivalent to > > scipy's fcluster. > > > > It differs in that it enable to add connectivity contraints. > > > > > > ------------------------------ > > > > Message: 2 > > Date: Fri, 4 Nov 2016 11:45:39 +0100 > > From: Marcin Miro?czuk > > To: scikit-learn at python.org > > Subject: [scikit-learn] Naive Bayes - Multinomial Naive Bayes tf-idf > > Message-ID: > > > > Content-Type: text/plain; charset="utf-8" > > > > Hi, > > In our experiments, we use a Multinomial Naive Bayes (MNB). The traditional > > MNB implies the TF weight of the words. We read in documentation > http://scikit-learn.org/stable/modules/naive_bayes.html which describes > > Multinomial Naive Bayes that "... where the data are typically represented > > as word vector counts, although tf-idf vectors are also known to work well > > in practice". The "word vector counts" is a TF and it is well known. We > > have a problem which the "tf-idf vectors". In this case, i.e. tf-idf it > > was implemented the approach of the D. M. Rennie et all Tackling the Poor > > Assumptions of Naive Bayes Text Classification? In the documentation, there > > are not any citation of this solution. > > > > Best, > > > > * This is an EXTERNAL EMAIL. Stop and think before clicking a link or > opening attachments. > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alessio.quaglino at usi.ch Tue Nov 8 10:10:07 2016 From: alessio.quaglino at usi.ch (Quaglino Alessio) Date: Tue, 8 Nov 2016 15:10:07 +0000 Subject: [scikit-learn] GPR intervals and MCMC Message-ID: <03BFB7AA-8ED7-487E-A257-CB028F2BF99B@usi.ch> Hello, I am using scikit-learn 0.18 for doing GP regressions. I really like it and all works great, but I am having doubts concerning the confidence intervals computed by predict(X,return_std=True): - Are they true confidence intervals (i.e. of the mean / latent function) or they are in fact prediction intervals? I tried computing the prediction intervals using sample_y(X) and I get the same answer as that returned by predict(X,return_std=True). - My understanding is therefore that scikit-learn is not fully Bayesian, i.e. it does not compute probability distributions for the parameters, but rather the values that maximize the likelihood? - If I want the confidence interval, is my best option to use an external MCMC optimizer such as PyMC? Thank you in advance! Regards, ------------------------------------------------- Dr. Alessio Quaglino Postdoctoral Researcher Institute of Computational Science Universit? della Svizzera Italiana -------------- next part -------------- An HTML attachment was scrubbed... URL: From vaggi.federico at gmail.com Tue Nov 8 10:19:35 2016 From: vaggi.federico at gmail.com (federico vaggi) Date: Tue, 08 Nov 2016 15:19:35 +0000 Subject: [scikit-learn] GPR intervals and MCMC In-Reply-To: <03BFB7AA-8ED7-487E-A257-CB028F2BF99B@usi.ch> References: <03BFB7AA-8ED7-487E-A257-CB028F2BF99B@usi.ch> Message-ID: Hi, if you want to have the full posterior distribution over the values of the hyper parameters, there is a good example on how to do that with George + emcee, another GP package for Python. http://dan.iel.fm/george/current/user/hyper/ On Tue, 8 Nov 2016 at 16:10 Quaglino Alessio wrote: > Hello, > > I am using scikit-learn 0.18 for doing GP regressions. I really like it > and all works great, but I am having doubts concerning the confidence > intervals computed by predict(X,return_std=True): > > - Are they true confidence intervals (i.e. of the mean / latent function) > or they are in fact prediction intervals? I tried computing the prediction > intervals using sample_y(X) and I get the same answer as that returned by > predict(X,return_std=True). > > - My understanding is therefore that scikit-learn is not fully Bayesian, > i.e. it does not compute probability distributions for the parameters, but > rather the values that maximize the likelihood? > > - If I want the confidence interval, is my best option to use an external > MCMC optimizer such as PyMC? > > Thank you in advance! > > Regards, > ------------------------------------------------- > Dr. Alessio Quaglino > Postdoctoral Researcher > Institute of Computational Science > Universit? della Svizzera Italiana > > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.eickenberg at gmail.com Tue Nov 8 10:24:01 2016 From: michael.eickenberg at gmail.com (Michael Eickenberg) Date: Tue, 8 Nov 2016 16:24:01 +0100 Subject: [scikit-learn] GPR intervals and MCMC In-Reply-To: <03BFB7AA-8ED7-487E-A257-CB028F2BF99B@usi.ch> References: <03BFB7AA-8ED7-487E-A257-CB028F2BF99B@usi.ch> Message-ID: Dear Alessio, if it helps, the implementation quite strictly follows what is described in GPML: http://www.gaussianprocess.org/gpml/chapters/ https://github.com/scikit-learn/scikit-learn/blob/412996f09b6756752dfd3736c306d46fca8f1aa1/sklearn/gaussian_process/gpr.py#L23 Hyperparameter optimization is done by gradient descent. Michael On Tue, Nov 8, 2016 at 4:10 PM, Quaglino Alessio wrote: > Hello, > > I am using scikit-learn 0.18 for doing GP regressions. I really like it > and all works great, but I am having doubts concerning the confidence > intervals computed by predict(X,return_std=True): > > - Are they true confidence intervals (i.e. of the mean / latent function) > or they are in fact prediction intervals? I tried computing the prediction > intervals using sample_y(X) and I get the same answer as that returned by > predict(X,return_std=True). > > - My understanding is therefore that scikit-learn is not fully Bayesian, > i.e. it does not compute probability distributions for the parameters, but > rather the values that maximize the likelihood? > > - If I want the confidence interval, is my best option to use an external > MCMC optimizer such as PyMC? > > Thank you in advance! > > Regards, > ------------------------------------------------- > Dr. Alessio Quaglino > Postdoctoral Researcher > Institute of Computational Science > Universit? della Svizzera Italiana > > > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.suchaneck at gmail.com Fri Nov 11 05:23:12 2016 From: a.suchaneck at gmail.com (Anton Suchaneck) Date: Fri, 11 Nov 2016 11:23:12 +0100 Subject: [scikit-learn] Automatic ThresholdClassifier based on cost-function - Classifier Interface? Message-ID: Hi! I tried writing a ThresholdClassifier, that wraps any classifier with predict_proba() and based on a cost function adjusts the threshold for predict(). This helps for imbalanced data. My current cost function assigns cost +cost for a true positive and -1 for a false positive. It seems to run, but I'm not sure if I got the API for a classifier right. Can you tell me whether this is how the functions should be implemented to play together with other parts of sklearn? Especially parameter settings for base.clone both in klass.__init__ and .set_params() seemed weird. Here is the code. The class ThresholdClassifier wraps a clf. RandomForest in this case. Anton from sklearn.base import BaseEstimator, ClassifierMixin from functools import partial def find_threshold_cost_factor(clf, X, y, cost_factor): y_pred = clf.predict_proba(X) top_score = 0 top_threshold = None cur_score=0 for y_pred_el, y_el in sorted(zip(y_pred[:, 1], y), reverse=True): # FIXME: assumes 2 classes if y_el == 0: cur_score -= 1 if y_el == 1: cur_score += cost_factor if cur_score > top_score: top_score = cur_score top_threshold = y_pred_el return top_threshold, top_score class ThresholdClassifier(BaseEstimator, ClassifierMixin): def __init__(self, clf, find_threshold, **params): self.clf = clf self.find_threshold = find_threshold self.threshold = None self.set_params(**params) def score(self, X, y, sample_weight=None): _threshold, score = self.find_threshold(self.clf, X, y) return score def fit(self, X, y): self.clf.fit(X, y) self.threshold, _score=self.find_threshold(self.clf, X, y) self.classes_ = self.clf.classes_ def predict(self, X): y_score=self.clf.predict_proba(X) return np.array(y_score[:,1]>=self.threshold) # FIXME assumes 2 classes def predict_proba(self, X): return self.clf.predict_proba(X) def set_params(self, **params): for param_name in ["clf", "find_threshold", "threshold"]: if param_name in params: setattr(self, param_name, params[param_name]) del params[param_name] self.clf.set_params(**params) return self def get_params(self, deep=True): params={"clf":self.clf, "find_threshold": self.find_threshold, "threshold":self.threshold} params.update(self.clf.get_params(deep)) return params if __name__ == '__main__': import numpy as np import random from sklearn.grid_search import RandomizedSearchCV from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import make_classification from sklearn.cross_validation import train_test_split from sklearn.metrics import make_scorer, classification_report, confusion_matrix np.random.seed(111) random.seed(111) X, y = make_classification(1000, n_features=20, n_informative=4, n_redundant=0, n_repeated=0, n_clusters_per_class=4, # class_sep=0.5, weights=[0.90] ) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y) for cost in [10]: find_threshold=partial(find_threshold_cost_factor, cost_factor=10) def scorer(clf, X, y): return find_threshold(clf, X, y)[1] clfs = [RandomizedSearchCV( ThresholdClassifier(RandomForestClassifier(), find_threshold), {"n_estimators": [100, 200], "criterion": ["entropy"], "min_samples_leaf": [1, 5], "class_weight": ["balanced", None], }, cv=3, scoring=scorer, # Get rid of this, by letting classifier tell it's cost-bsed score? n_iter=8, n_jobs=4), ] for clf in clfs: clf.fit(X_train, y_train) clf_best = clf.best_estimator_ print(clf_best, cost, clf_best.score(X_test, y_test)) print(confusion_matrix(y_test, clf_best.predict(X_test))) #print(find_threshold(clf_best, X_train, y_train)) #print(clf_best.threshold, sorted(zip(clf_best.predict_proba(X_train)[:,1], y_train), reverse=True)[:20]) -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Nov 11 13:09:32 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 11 Nov 2016 13:09:32 -0500 Subject: [scikit-learn] Automatic ThresholdClassifier based on cost-function - Classifier Interface? In-Reply-To: References: Message-ID: <89f50f08-5fa4-4ad6-8d15-3e5e70d15087@gmail.com> Hi. You don't have to implement set_params and get_params if you inherit from BaseEstimator. I find it weird that you pass find_threshold_cost_function as a constructor parameter but otherwise the API looks ok. You are not allowed to use **kwargs in __init___, though. Andy On 11/11/2016 05:23 AM, Anton Suchaneck wrote: > Hi! > > I tried writing a ThresholdClassifier, that wraps any classifier with > predict_proba() and based on a cost function adjusts the threshold for > predict(). This helps for imbalanced data. > My current cost function assigns cost +cost for a true positive and -1 > for a false positive. > It seems to run, but I'm not sure if I got the API for a classifier right. > > Can you tell me whether this is how the functions should be > implemented to play together with other parts of sklearn? > > Especially parameter settings for base.clone both in klass.__init__ > and .set_params() seemed weird. > > Here is the code. The class ThresholdClassifier wraps a clf. > RandomForest in this case. > > Anton > > from sklearn.base import BaseEstimator, ClassifierMixin > from functools import partial > > def find_threshold_cost_factor(clf, X, y, cost_factor): > y_pred = clf.predict_proba(X) > > top_score = 0 > top_threshold = None > cur_score=0 > for y_pred_el, y_el in sorted(zip(y_pred[:, 1], y), reverse=True): > # FIXME: assumes 2 classes > if y_el == 0: > cur_score -= 1 > if y_el == 1: > cur_score += cost_factor > if cur_score > top_score: > top_score = cur_score > top_threshold = y_pred_el > return top_threshold, top_score > > > class ThresholdClassifier(BaseEstimator, ClassifierMixin): > def __init__(self, clf, find_threshold, **params): > self.clf = clf > self.find_threshold = find_threshold > self.threshold = None > self.set_params(**params) > > def score(self, X, y, sample_weight=None): > _threshold, score = self.find_threshold(self.clf, X, y) > return score > > def fit(self, X, y): > self.clf.fit(X, y) > self.threshold, _score=self.find_threshold(self.clf, X, y) > self.classes_ = self.clf.classes_ > > def predict(self, X): > y_score=self.clf.predict_proba(X) > return np.array(y_score[:,1]>=self.threshold) # FIXME assumes > 2 classes > > def predict_proba(self, X): > return self.clf.predict_proba(X) > > def set_params(self, **params): > for param_name in ["clf", "find_threshold", "threshold"]: > if param_name in params: > setattr(self, param_name, params[param_name]) > del params[param_name] > self.clf.set_params(**params) > return self > > def get_params(self, deep=True): > params={"clf":self.clf, "find_threshold": self.find_threshold, > "threshold":self.threshold} > params.update(self.clf.get_params(deep)) > return params > > > if __name__ == '__main__': > import numpy as np > import random > from sklearn.grid_search import RandomizedSearchCV > from sklearn.ensemble import RandomForestClassifier > from sklearn.datasets import make_classification > from sklearn.cross_validation import train_test_split > from sklearn.metrics import make_scorer, classification_report, > confusion_matrix > > np.random.seed(111) > random.seed(111) > > X, y = make_classification(1000, > n_features=20, > n_informative=4, > n_redundant=0, > n_repeated=0, > n_clusters_per_class=4, > # class_sep=0.5, > weights=[0.90] > ) > > X_train, X_test, y_train, y_test = train_test_split(X, y, > test_size=0.3, stratify=y) > > for cost in [10]: > find_threshold=partial(find_threshold_cost_factor, cost_factor=10) > > def scorer(clf, X, y): > return find_threshold(clf, X, y)[1] > > clfs = [RandomizedSearchCV( > ThresholdClassifier(RandomForestClassifier(), find_threshold), > {"n_estimators": [100, 200], > "criterion": ["entropy"], > "min_samples_leaf": [1, 5], > "class_weight": ["balanced", None], > }, > cv=3, > scoring=scorer, # Get rid of this, by letting > classifier tell it's cost-bsed score? > n_iter=8, > n_jobs=4), > ] > > for clf in clfs: > clf.fit(X_train, y_train) > clf_best = clf.best_estimator_ > print(clf_best, cost, clf_best.score(X_test, y_test)) > print(confusion_matrix(y_test, clf_best.predict(X_test))) > #print(find_threshold(clf_best, X_train, y_train)) > #print(clf_best.threshold, > sorted(zip(clf_best.predict_proba(X_train)[:,1], y_train), > reverse=True)[:20]) > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.suchaneck at gmail.com Sat Nov 12 04:17:29 2016 From: a.suchaneck at gmail.com (Anton) Date: Sat, 12 Nov 2016 10:17:29 +0100 Subject: [scikit-learn] Automatic ThresholdClassifier based on cost-function - Classifier Interface? In-Reply-To: <89f50f08-5fa4-4ad6-8d15-3e5e70d15087@gmail.com> References: <89f50f08-5fa4-4ad6-8d15-3e5e70d15087@gmail.com> Message-ID: <1478942249.8133.0@smtp.gmail.com> Hi Andy! Thank you for your feedback! You say I shouldn't use __init__(**params) and it makes totally sense and would make my code much simpler. However, sklearn 0.18, base.clone, line 70: new_object = klass(**new_object_params) (called from RandomizedSearchCV) screws you over since it passes the parameters to __init__(). I expected the usage of set_params() here, but I'm getting my gridsearch parameters passed to __init__(). Is this intended? Note that I'm just wrapping a clf, so that I have to pass through the parameters to self.clf, right? No-one can know that I'm storing it in self.clf. Therefore set_params needs to be implemented and cannot be inherited?! My meta-classifier will find the optimal threshold upon .fit(). This procedure depends on how to interpret what is optimal and this is what find_threshold_cost_function is for. One last question: Is self.classes_ a necessary part of the API (I realize I forget the underscore) and am I missing any other API detail I need to add for a binary classifier? Regards, Anton Am Fr, 11. Nov, 2016 um 7:09 schrieb Andreas Mueller : > Hi. > You don't have to implement set_params and get_params if you inherit > from BaseEstimator. > I find it weird that you pass find_threshold_cost_function as a > constructor parameter but otherwise the API looks ok. > You are not allowed to use **kwargs in __init___, though. > > Andy > > On 11/11/2016 05:23 AM, Anton Suchaneck wrote: >> Hi! >> >> I tried writing a ThresholdClassifier, that wraps any classifier >> with predict_proba() and based on a cost function adjusts the >> threshold for predict(). This helps for imbalanced data. >> My current cost function assigns cost +cost for a true positive and >> -1 for a false positive. >> It seems to run, but I'm not sure if I got the API for a classifier >> right. >> >> Can you tell me whether this is how the functions should be >> implemented to play together with other parts of sklearn? >> >> Especially parameter settings for base.clone both in klass.__init__ >> and .set_params() seemed weird. >> >> Here is the code. The class ThresholdClassifier wraps a clf. >> RandomForest in this case. >> >> Anton >> >> from sklearn.base import BaseEstimator, ClassifierMixin >> from functools import partial >> >> def find_threshold_cost_factor(clf, X, y, cost_factor): >> y_pred = clf.predict_proba(X) >> >> top_score = 0 >> top_threshold = None >> cur_score=0 >> for y_pred_el, y_el in sorted(zip(y_pred[:, 1], y), >> reverse=True): # FIXME: assumes 2 classes >> if y_el == 0: >> cur_score -= 1 >> if y_el == 1: >> cur_score += cost_factor >> if cur_score > top_score: >> top_score = cur_score >> top_threshold = y_pred_el >> return top_threshold, top_score >> >> >> class ThresholdClassifier(BaseEstimator, ClassifierMixin): >> def __init__(self, clf, find_threshold, **params): >> self.clf = clf >> self.find_threshold = find_threshold >> self.threshold = None >> self.set_params(**params) >> >> def score(self, X, y, sample_weight=None): >> _threshold, score = self.find_threshold(self.clf, X, y) >> return score >> >> def fit(self, X, y): >> self.clf.fit(X, y) >> self.threshold, _score=self.find_threshold(self.clf, X, y) >> self.classes_ = self.clf.classes_ >> >> def predict(self, X): >> y_score=self.clf.predict_proba(X) >> return np.array(y_score[:,1]>=self.threshold) # FIXME >> assumes 2 classes >> >> def predict_proba(self, X): >> return self.clf.predict_proba(X) >> >> def set_params(self, **params): >> for param_name in ["clf", "find_threshold", "threshold"]: >> if param_name in params: >> setattr(self, param_name, params[param_name]) >> del params[param_name] >> self.clf.set_params(**params) >> return self >> >> def get_params(self, deep=True): >> params={"clf":self.clf, "find_threshold": >> self.find_threshold, "threshold":self.threshold} >> params.update(self.clf.get_params(deep)) >> return params >> >> >> if __name__ == '__main__': >> import numpy as np >> import random >> from sklearn.grid_search import RandomizedSearchCV >> from sklearn.ensemble import RandomForestClassifier >> from sklearn.datasets import make_classification >> from sklearn.cross_validation import train_test_split >> from sklearn.metrics import make_scorer, classification_report, >> confusion_matrix >> >> np.random.seed(111) >> random.seed(111) >> >> X, y = make_classification(1000, >> n_features=20, >> n_informative=4, >> n_redundant=0, >> n_repeated=0, >> n_clusters_per_class=4, >> # class_sep=0.5, >> weights=[0.90] >> ) >> >> X_train, X_test, y_train, y_test = train_test_split(X, y, >> test_size=0.3, stratify=y) >> >> for cost in [10]: >> find_threshold=partial(find_threshold_cost_factor, >> cost_factor=10) >> >> def scorer(clf, X, y): >> return find_threshold(clf, X, y)[1] >> >> clfs = [RandomizedSearchCV( >> ThresholdClassifier(RandomForestClassifier(), >> find_threshold), >> {"n_estimators": [100, 200], >> "criterion": ["entropy"], >> "min_samples_leaf": [1, 5], >> "class_weight": ["balanced", None], >> }, >> cv=3, >> scoring=scorer, # Get rid of this, by letting >> classifier tell it's cost-bsed score? >> n_iter=8, >> n_jobs=4), >> ] >> >> for clf in clfs: >> clf.fit(X_train, y_train) >> clf_best = clf.best_estimator_ >> print(clf_best, cost, clf_best.score(X_test, y_test)) >> print(confusion_matrix(y_test, clf_best.predict(X_test))) >> #print(find_threshold(clf_best, X_train, y_train)) >> #print(clf_best.threshold, >> sorted(zip(clf_best.predict_proba(X_train)[:,1], y_train), >> reverse=True)[:20]) >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Sun Nov 13 17:37:17 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Sun, 13 Nov 2016 17:37:17 -0500 Subject: [scikit-learn] Automatic ThresholdClassifier based on cost-function - Classifier Interface? In-Reply-To: <1478942249.8133.0@smtp.gmail.com> References: <89f50f08-5fa4-4ad6-8d15-3e5e70d15087@gmail.com> <1478942249.8133.0@smtp.gmail.com> Message-ID: On 11/12/2016 04:17 AM, Anton wrote: > screws you over since it passes the parameters to __init__(). I > expected the usage of set_params() here, but I'm getting my gridsearch > parameters passed to __init__(). > Is this intended? > I don't know what you mean by "screws you over". You just have to explicitly list all parameters. > Note that I'm just wrapping a clf, so that I have to pass through the > parameters to self.clf, right? No-one can know that I'm storing it in > self.clf. > Therefore set_params needs to be implemented and cannot be inherited?! right, if you don't want to use ``clf__params`` -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Sun Nov 13 18:21:05 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Sun, 13 Nov 2016 18:21:05 -0500 Subject: [scikit-learn] Announcement: Scikit-learn 0.18.1 released! Message-ID: Hey all. I just published the 0.18.1 wheels and source tarball to pypi. The 0.18.1 release is a bugfix release, resolving some issues introduced in 0.18 and also some earlier issues. In particular there were some important relating to the new model_selection module. You can find the whole changelog (which I just realized does not contain all the fixes) here: http://scikit-learn.org/stable/whats_new.html#version-0-18-1 Best, Andy From joel.nothman at gmail.com Sun Nov 13 18:42:54 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 14 Nov 2016 10:42:54 +1100 Subject: [scikit-learn] Announcement: Scikit-learn 0.18.1 released! In-Reply-To: References: Message-ID: Thanks, Andy. As Andy said, this upgrade is strongly recommended. Due to a long-term bug in Numpy (and insufficient testing on our part), the new model_selection.GridSearchCV etc could not be pickled. There were also issues with the use of iterators for cross-validation splitters. But there are a lot of other valuable fixes in there too. Please everyone, tell us there are no more bugs! :P On 14 November 2016 at 10:21, Andreas Mueller wrote: > Hey all. > I just published the 0.18.1 wheels and source tarball to pypi. > The 0.18.1 release is a bugfix release, resolving some issues introduced > in 0.18 and also some earlier issues. > In particular there were some important relating to the new > model_selection module. > > You can find the whole changelog (which I just realized does not contain > all the fixes) here: > http://scikit-learn.org/stable/whats_new.html#version-0-18-1 > > Best, > Andy > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Sun Nov 13 20:51:01 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Sun, 13 Nov 2016 17:51:01 -0800 Subject: [scikit-learn] Announcement: Scikit-learn 0.18.1 released! In-Reply-To: References: Message-ID: Yeah it would be great if someone could update the whatsnew with the complete list of fixed issues. Unfortunately I'm a bit overloaded right now. Sent from phone. Please excuse spelling and brevity. On Nov 13, 2016 18:44, "Joel Nothman" wrote: > Thanks, Andy. > > As Andy said, this upgrade is strongly recommended. Due to a long-term bug > in Numpy (and insufficient testing on our part), the new > model_selection.GridSearchCV etc could not be pickled. There were also > issues with the use of iterators for cross-validation splitters. But there > are a lot of other valuable fixes in there too. > > Please everyone, tell us there are no more bugs! :P > > On 14 November 2016 at 10:21, Andreas Mueller wrote: > >> Hey all. >> I just published the 0.18.1 wheels and source tarball to pypi. >> The 0.18.1 release is a bugfix release, resolving some issues introduced >> in 0.18 and also some earlier issues. >> In particular there were some important relating to the new >> model_selection module. >> >> You can find the whole changelog (which I just realized does not contain >> all the fixes) here: >> http://scikit-learn.org/stable/whats_new.html#version-0-18-1 >> >> Best, >> Andy >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.suchaneck at gmail.com Mon Nov 14 01:29:12 2016 From: a.suchaneck at gmail.com (Anton) Date: Mon, 14 Nov 2016 07:29:12 +0100 Subject: [scikit-learn] Automatic ThresholdClassifier based on cost-function - Classifier Interface? In-Reply-To: References: <89f50f08-5fa4-4ad6-8d15-3e5e70d15087@gmail.com> <1478942249.8133.0@smtp.gmail.com> Message-ID: <1479104952.9218.0@smtp.gmail.com> > >> screws you over since it passes the parameters to __init__(). I >> expected the usage of set_params() here, but I'm getting my >> gridsearch parameters passed to __init__(). >> Is this intended? >> > I don't know what you mean by "screws you over". You just have to > explicitly list all parameters. There is a hidden assumption that next to set_param() some methods may alternatively use __init__() to set parameters. That's why I had to jump through multiple hoops to get a meta-classifier which transparently shadows all variables. Usually you would expect that all parts stick to the convention of using one way to set parameters only (set_params()) -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Mon Nov 14 06:14:06 2016 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Mon, 14 Nov 2016 12:14:06 +0100 Subject: [scikit-learn] suggested classification algorithm Message-ID: Greetings, I want to design a program that can deal with classification problems of the same type, where the number of positive observations is small but the number of negative much larger. Speaking with numbers, the number of positive observations could range usually between 2 to 20 and the number of negative could be at least x30 times larger. The number of features could be between 2 and 20 too, but that could be reduced using feature selection and elimination algorithms. I 've read in the documentation that some algorithms like the SVM are still effective when the number of dimensions is greater than the number of samples, but I am not sure if they are suitable for my case. Moreover, according to this Figure, the Nearest Neighbors is the best and second is the RBF SVM: http://scikit-learn.org/stable/_images/sphx_glr _plot_classifier_comparison_001.png However, I assume that Nearest Neighbors would not be effective in my case where the number of positive observations is very low. For these reasons I would like to know your expert opinion about which classification algorithm should I try first. thanks in advance Thomas -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Nov 14 06:20:22 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 14 Nov 2016 22:20:22 +1100 Subject: [scikit-learn] suggested classification algorithm In-Reply-To: References: Message-ID: http://contrib.scikit-learn.org/imbalanced-learn/ might be of interest to you. On 14 November 2016 at 22:14, Thomas Evangelidis wrote: > Greetings, > > I want to design a program that can deal with classification problems of > the same type, where the number of positive observations is small but the > number of negative much larger. Speaking with numbers, the number of > positive observations could range usually between 2 to 20 and the number of > negative could be at least x30 times larger. The number of features could > be between 2 and 20 too, but that could be reduced using feature selection > and elimination algorithms. I 've read in the documentation that some > algorithms like the SVM are still effective when the number of dimensions > is greater than the number of samples, but I am not sure if they are > suitable for my case. Moreover, according to this Figure, the Nearest > Neighbors is the best and second is the RBF SVM: > > http://scikit-learn.org/stable/_images/sphx_glr_plot_ > classifier_comparison_001.png > > However, I assume that Nearest Neighbors would not be effective in my > case where the number of positive observations is very low. For these > reasons I would like to know your expert opinion about which classification > algorithm should I try first. > > thanks in advance > Thomas > > > -- > > ====================================================================== > > Thomas Evangelidis > > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Nov 14 08:39:47 2016 From: t3kcit at gmail.com (Andy) Date: Mon, 14 Nov 2016 08:39:47 -0500 Subject: [scikit-learn] Automatic ThresholdClassifier based on cost-function - Classifier Interface? In-Reply-To: <1479104952.9218.0@smtp.gmail.com> References: <89f50f08-5fa4-4ad6-8d15-3e5e70d15087@gmail.com> <1478942249.8133.0@smtp.gmail.com> <1479104952.9218.0@smtp.gmail.com> Message-ID: <61e3dba3-f315-452d-9280-7f1f0cf06cb1@gmail.com> On 11/14/2016 01:29 AM, Anton wrote: >> >>> screws you over since it passes the parameters to __init__(). I >>> expected the usage of set_params() here, but I'm getting my >>> gridsearch parameters passed to __init__(). >>> Is this intended? >>> >> I don't know what you mean by "screws you over". You just have to >> explicitly list all parameters. > > There is a hidden assumption that next to set_param() some methods may > alternatively use __init__() to set parameters. That's why I had to > jump through multiple hoops to get a meta-classifier which > transparently shadows all variables. Usually you would expect that all > parts stick to the convention of using one way to set parameters only > (set_params()) > Why would you expect that? Given the way clone works, basically no part of scikit-learn does that. Have you read http://scikit-learn.org/dev/developers/contributing.html#rolling-your-own-estimator ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Mon Nov 14 12:29:16 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Mon, 14 Nov 2016 18:29:16 +0100 Subject: [scikit-learn] Announcement: Scikit-learn 0.18.1 released! In-Reply-To: References: Message-ID: <20161114172916.GJ1918706@phare.normalesup.org> Thank you so much Andy and the others that made this .1 release possible. It brings huge value in ensuring quality. Ga?l On Sun, Nov 13, 2016 at 06:21:05PM -0500, Andreas Mueller wrote: > Hey all. > I just published the 0.18.1 wheels and source tarball to pypi. > The 0.18.1 release is a bugfix release, resolving some issues introduced in > 0.18 and also some earlier issues. > In particular there were some important relating to the new model_selection > module. > You can find the whole changelog (which I just realized does not contain all > the fixes) here: > http://scikit-learn.org/stable/whats_new.html#version-0-18-1 > Best, > Andy > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From ragvrv at gmail.com Tue Nov 15 10:30:53 2016 From: ragvrv at gmail.com (Raghav R V) Date: Tue, 15 Nov 2016 16:30:53 +0100 Subject: [scikit-learn] Announcement: Scikit-learn 0.18.1 released! In-Reply-To: <20161114172916.GJ1918706@phare.normalesup.org> References: <20161114172916.GJ1918706@phare.normalesup.org> Message-ID: Hurray :D Thanks heaps Andy, Joel and the whole team! On Mon, Nov 14, 2016 at 6:29 PM, Gael Varoquaux < gael.varoquaux at normalesup.org> wrote: > Thank you so much Andy and the others that made this .1 release possible. > It brings huge value in ensuring quality. > > Ga?l > > On Sun, Nov 13, 2016 at 06:21:05PM -0500, Andreas Mueller wrote: > > Hey all. > > I just published the 0.18.1 wheels and source tarball to pypi. > > The 0.18.1 release is a bugfix release, resolving some issues introduced > in > > 0.18 and also some earlier issues. > > In particular there were some important relating to the new > model_selection > > module. > > > You can find the whole changelog (which I just realized does not contain > all > > the fixes) here: > > http://scikit-learn.org/stable/whats_new.html#version-0-18-1 > > > Best, > > Andy > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > -- > Gael Varoquaux > Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Raghav RV https://github.com/raghavrv -------------- next part -------------- An HTML attachment was scrubbed... URL: From avn at mccme.ru Wed Nov 16 05:58:26 2016 From: avn at mccme.ru (avn at mccme.ru) Date: Wed, 16 Nov 2016 13:58:26 +0300 Subject: [scikit-learn] Including figures from scikit-learn documentation in scientific publications Message-ID: Hello, I'm writing a paper meant for submission to TPAMI and would like to include that wonderful figure of clustering algorithms comparison found in scikit-learn documentation (http://scikit-learn.org/stable/_images/sphx_glr_plot_cluster_comparison_001.png). So, my question is: can I include the figure PNG file in my paper directly (with proper reference, of course) or should I only provide a reference to this figure? With best regards, -- Valery From gael.varoquaux at normalesup.org Wed Nov 16 06:08:32 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Wed, 16 Nov 2016 12:08:32 +0100 Subject: [scikit-learn] Including figures from scikit-learn documentation in scientific publications In-Reply-To: References: Message-ID: <20161116110832.GD3227973@phare.normalesup.org> Grabbing the PNG and including a reference is perfectly fine. I think that the right way would be to cite the paper and the URL of the page where the figure is. From avn at mccme.ru Wed Nov 16 06:34:17 2016 From: avn at mccme.ru (avn at mccme.ru) Date: Wed, 16 Nov 2016 14:34:17 +0300 Subject: [scikit-learn] Including figures from scikit-learn documentation in scientific publications In-Reply-To: <20161116110832.GD3227973@phare.normalesup.org> References: <20161116110832.GD3227973@phare.normalesup.org> Message-ID: Ok, I'll provide a reference to http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html (the paper is anyway cited since scikit-learn is used in my work). Hope that this URL is not going to change in future releases of scikit-learn. Thanks for the answer, Gael! Gael Varoquaux ????? 2016-11-16 14:08: > Grabbing the PNG and including a reference is perfectly fine. > > I think that the right way would be to cite the paper and the URL of > the > page where the figure is. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From nfliu at uw.edu Wed Nov 16 12:32:19 2016 From: nfliu at uw.edu (Nelson Liu) Date: Wed, 16 Nov 2016 09:32:19 -0800 Subject: [scikit-learn] Including figures from scikit-learn documentation in scientific publications In-Reply-To: References: <20161116110832.GD3227973@phare.normalesup.org> Message-ID: It might be worthwhile to put a reference to http://scikit-learn.org/0.18/auto_examples/cluster/plot_cluster_comparison.html instead, in case the figure changes in future versions. Nelson On Wed, Nov 16, 2016 at 3:34 AM, wrote: > Ok, I'll provide a reference to http://scikit-learn.org/stable > /auto_examples/cluster/plot_cluster_comparison.html (the paper is anyway > cited since scikit-learn is used in my work). > Hope that this URL is not going to change in future releases of > scikit-learn. > > Thanks for the answer, Gael! > > Gael Varoquaux ????? 2016-11-16 14:08: > > Grabbing the PNG and including a reference is perfectly fine. >> >> I think that the right way would be to cite the paper and the URL of the >> page where the figure is. >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From avn at mccme.ru Wed Nov 16 14:14:52 2016 From: avn at mccme.ru (avn at mccme.ru) Date: Wed, 16 Nov 2016 22:14:52 +0300 Subject: [scikit-learn] Including figures from scikit-learn documentation in scientific publications In-Reply-To: References: <20161116110832.GD3227973@phare.normalesup.org> Message-ID: Yes, it seems to be more adequate URL. Nelson Liu ????? 2016-11-16 20:32: > It might be worthwhile to put a reference to > http://scikit-learn.org/0.18/auto_examples/cluster/plot_cluster_comparison.html > instead, in case the figure changes in future versions. > > Nelson > > On Wed, Nov 16, 2016 at 3:34 AM, wrote: > >> Ok, I'll provide a reference to >> > http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html >> [2] (the paper is anyway cited since scikit-learn is used in my >> work). >> Hope that this URL is not going to change in future releases of >> scikit-learn. >> >> Thanks for the answer, Gael! >> >> Gael Varoquaux ????? 2016-11-16 14:08: >> >>> Grabbing the PNG and including a reference is perfectly fine. >>> >>> I think that the right way would be to cite the paper and the URL >>> of the >>> page where the figure is. >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn [1] >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn [1] > > > > Links: > ------ > [1] https://mail.python.org/mailman/listinfo/scikit-learn > [2] > http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From fernando.wittmann at gmail.com Wed Nov 16 15:10:48 2016 From: fernando.wittmann at gmail.com (Fernando Marcos Wittmann) Date: Wed, 16 Nov 2016 18:10:48 -0200 Subject: [scikit-learn] suggested classification algorithm In-Reply-To: References: Message-ID: Three based algorithms (like Random Forest) usually work well for imbalanced datasets. You can also take a look at the SMOTE technique ( http://jair.org/media/953/live-953-2037-jair.pdf) which you can use for over-sampling the positive observations. On Mon, Nov 14, 2016 at 9:14 AM, Thomas Evangelidis wrote: > Greetings, > > I want to design a program that can deal with classification problems of > the same type, where the number of positive observations is small but the > number of negative much larger. Speaking with numbers, the number of > positive observations could range usually between 2 to 20 and the number of > negative could be at least x30 times larger. The number of features could > be between 2 and 20 too, but that could be reduced using feature selection > and elimination algorithms. I 've read in the documentation that some > algorithms like the SVM are still effective when the number of dimensions > is greater than the number of samples, but I am not sure if they are > suitable for my case. Moreover, according to this Figure, the Nearest > Neighbors is the best and second is the RBF SVM: > > http://scikit-learn.org/stable/_images/sphx_glr_plot_ > classifier_comparison_001.png > > However, I assume that Nearest Neighbors would not be effective in my > case where the number of positive observations is very low. For these > reasons I would like to know your expert opinion about which classification > algorithm should I try first. > > thanks in advance > Thomas > > > -- > > ====================================================================== > > Thomas Evangelidis > > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Fernando Marcos Wittmann MS Student - Energy Systems Dept. School of Electrical and Computer Engineering, FEEC University of Campinas, UNICAMP, Brazil +55 (19) 987-211302 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Dale.T.Smith at macys.com Wed Nov 16 15:54:21 2016 From: Dale.T.Smith at macys.com (Dale T Smith) Date: Wed, 16 Nov 2016 20:54:21 +0000 Subject: [scikit-learn] suggested classification algorithm In-Reply-To: References: Message-ID: Unbalanced class classification has been a topic here in past years, and there are posts if you search the archives. There are also plenty of resources available to help you, from actual code on Stackoverflow, to papers that address various ideas. I don?t think it?s necessary to repeat any of this on the mailing list. __________________________________________________________________________________________________________________________________________ Dale T. Smith | Macy's Systems and Technology | IFS eCom CSE Data Science 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Fernando Marcos Wittmann Sent: Wednesday, November 16, 2016 3:11 PM To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] suggested classification algorithm ? EXT MSG: Three based algorithms (like Random Forest) usually work well for imbalanced datasets. You can also take a look at the SMOTE technique (http://jair.org/media/953/live-953-2037-jair.pdf) which you can use for over-sampling the positive observations. On Mon, Nov 14, 2016 at 9:14 AM, Thomas Evangelidis > wrote: Greetings, I want to design a program that can deal with classification problems of the same type, where the number of positive observations is small but the number of negative much larger. Speaking with numbers, the number of positive observations could range usually between 2 to 20 and the number of negative could be at least x30 times larger. The number of features could be between 2 and 20 too, but that could be reduced using feature selection and elimination algorithms. I 've read in the documentation that some algorithms like the SVM are still effective when the number of dimensions is greater than the number of samples, but I am not sure if they are suitable for my case. Moreover, according to this Figure, the Nearest Neighbors is the best and second is the RBF SVM: http://scikit-learn.org/stable/_images/sphx_glr_plot_classifier_comparison_001.png However, I assume that Nearest Neighbors would not be effective in my case where the number of positive observations is very low. For these reasons I would like to know your expert opinion about which classification algorithm should I try first. thanks in advance Thomas -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -- Fernando Marcos Wittmann MS Student - Energy Systems Dept. School of Electrical and Computer Engineering, FEEC University of Campinas, UNICAMP, Brazil +55 (19) 987-211302 * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments. -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Wed Nov 16 16:20:17 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Wed, 16 Nov 2016 16:20:17 -0500 Subject: [scikit-learn] suggested classification algorithm In-Reply-To: References: Message-ID: <0C66AA1E-D7FC-4DCD-9DBD-FED8020A0296@gmail.com> Yeah, there are many useful resources and implementations scattered around the web. However, a good, brief overview of the general ideas and concepts would be this one, for example: http://www.svds.com/learning-imbalanced-classes/ > On Nov 16, 2016, at 3:54 PM, Dale T Smith wrote: > > Unbalanced class classification has been a topic here in past years, and there are posts if you search the archives. There are also plenty of resources available to help you, from actual code on Stackoverflow, to papers that address various ideas. I don?t think it?s necessary to repeat any of this on the mailing list. > > > __________________________________________________________________________________________________________________________________________ > Dale T. Smith | Macy's Systems and Technology | IFS eCom CSE Data Science > 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com > > From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Fernando Marcos Wittmann > Sent: Wednesday, November 16, 2016 3:11 PM > To: Scikit-learn user and developer mailing list > Subject: Re: [scikit-learn] suggested classification algorithm > > ? EXT MSG: > Three based algorithms (like Random Forest) usually work well for imbalanced datasets. You can also take a look at the SMOTE technique (http://jair.org/media/953/live-953-2037-jair.pdf) which you can use for over-sampling the positive observations. > > On Mon, Nov 14, 2016 at 9:14 AM, Thomas Evangelidis wrote: > Greetings, > > I want to design a program that can deal with classification problems of the same type, where the number of positive observations is small but the number of negative much larger. Speaking with numbers, the number of positive observations could range usually between 2 to 20 and the number of negative could be at least x30 times larger. The number of features could be between 2 and 20 too, but that could be reduced using feature selection and elimination algorithms. I 've read in the documentation that some algorithms like the SVM are still effective when the number of dimensions is greater than the number of samples, but I am not sure if they are suitable for my case. Moreover, according to this Figure, the Nearest Neighbors is the best and second is the RBF SVM: > > http://scikit-learn.org/stable/_images/sphx_glr_plot_classifier_comparison_001.png > > However, I assume that Nearest Neighbors would not be effective in my case where the number of positive observations is very low. For these reasons I would like to know your expert opinion about which classification algorithm should I try first. > > thanks in advance > Thomas > > > -- > ====================================================================== > Thomas Evangelidis > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > tevang3 at gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -- > > Fernando Marcos Wittmann > MS Student - Energy Systems Dept. > School of Electrical and Computer Engineering, FEEC > University of Campinas, UNICAMP, Brazil > +55 (19) 987-211302 > > * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From tevang3 at gmail.com Thu Nov 17 09:00:33 2016 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Thu, 17 Nov 2016 15:00:33 +0100 Subject: [scikit-learn] suggested classification algorithm In-Reply-To: <0C66AA1E-D7FC-4DCD-9DBD-FED8020A0296@gmail.com> References: <0C66AA1E-D7FC-4DCD-9DBD-FED8020A0296@gmail.com> Message-ID: Guys thank you all for your hints! Practical experience is irreplaceable that's why I posted this query here. I could read all week the mailing list archives and the respective internet resources but still not find the key info I could potentially get by someone here. I did PCA on my training set (this one has 24 positive and 1278 negative observation) and projected the 19 features on the first 2 PCs, which explain 87.6 % of the variance in the data. Does this plot help to decide which classification algorithms and/or over- or under-sampling would be more suitable? https://dl.dropboxusercontent.com/u/48168252/PCA_of_features.png thanks for your advices Thomas On 16 November 2016 at 22:20, Sebastian Raschka wrote: > Yeah, there are many useful resources and implementations scattered around > the web. However, a good, brief overview of the general ideas and concepts > would be this one, for example: http://www.svds.com/learning- > imbalanced-classes/ > > > > On Nov 16, 2016, at 3:54 PM, Dale T Smith > wrote: > > > > Unbalanced class classification has been a topic here in past years, and > there are posts if you search the archives. There are also plenty of > resources available to help you, from actual code on Stackoverflow, to > papers that address various ideas. I don?t think it?s necessary to repeat > any of this on the mailing list. > > > > > > ____________________________________________________________ > ____________________________________________________________ > __________________ > > Dale T. Smith | Macy's Systems and Technology | IFS eCom CSE Data Science > > 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com > > > > From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith= > macys.com at python.org] On Behalf Of Fernando Marcos Wittmann > > Sent: Wednesday, November 16, 2016 3:11 PM > > To: Scikit-learn user and developer mailing list > > Subject: Re: [scikit-learn] suggested classification algorithm > > > > ? EXT MSG: > > Three based algorithms (like Random Forest) usually work well for > imbalanced datasets. You can also take a look at the SMOTE technique ( > http://jair.org/media/953/live-953-2037-jair.pdf) which you can use for > over-sampling the positive observations. > > > > On Mon, Nov 14, 2016 at 9:14 AM, Thomas Evangelidis > wrote: > > Greetings, > > > > I want to design a program that can deal with classification problems of > the same type, where the number of positive observations is small but the > number of negative much larger. Speaking with numbers, the number of > positive observations could range usually between 2 to 20 and the number of > negative could be at least x30 times larger. The number of features could > be between 2 and 20 too, but that could be reduced using feature selection > and elimination algorithms. I 've read in the documentation that some > algorithms like the SVM are still effective when the number of dimensions > is greater than the number of samples, but I am not sure if they are > suitable for my case. Moreover, according to this Figure, the Nearest > Neighbors is the best and second is the RBF SVM: > > > > http://scikit-learn.org/stable/_images/sphx_glr_plot_ > classifier_comparison_001.png > > > > However, I assume that Nearest Neighbors would not be effective in my > case where the number of positive observations is very low. For these > reasons I would like to know your expert opinion about which classification > algorithm should I try first. > > > > thanks in advance > > Thomas > > > > > > -- > > ====================================================================== > > Thomas Evangelidis > > Research Specialist > > CEITEC - Central European Institute of Technology > > Masaryk University > > Kamenice 5/A35/1S081, > > 62500 Brno, Czech Republic > > > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > -- > > > > Fernando Marcos Wittmann > > MS Student - Energy Systems Dept. > > School of Electrical and Computer Engineering, FEEC > > University of Campinas, UNICAMP, Brazil > > +55 (19) 987-211302 > > > > * This is an EXTERNAL EMAIL. Stop and think before clicking a link or > opening attachments. > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: PCA_of_features.png Type: image/png Size: 106770 bytes Desc: not available URL: From Dale.T.Smith at macys.com Thu Nov 17 09:10:38 2016 From: Dale.T.Smith at macys.com (Dale T Smith) Date: Thu, 17 Nov 2016 14:10:38 +0000 Subject: [scikit-learn] suggested classification algorithm In-Reply-To: References: <0C66AA1E-D7FC-4DCD-9DBD-FED8020A0296@gmail.com> Message-ID: The problem with your analysis is it doesn?t include anything but features. You may want to look at Nina Zumel and John Mount?s work on y-aware PCR and PCA, as well as y-aware feature scaling. http://www.win-vector.com/blog/2016/05/pcr_part1_xonly/ http://www.win-vector.com/blog/2016/05/pcr_part2_yaware/ http://www.win-vector.com/blog/2016/06/y-aware-scaling-in-context/ __________________________________________________________________________________________________________________________________________ Dale T. Smith | Macy's Systems and Technology | IFS eCom CSE Data Science 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Thomas Evangelidis Sent: Thursday, November 17, 2016 9:01 AM To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] suggested classification algorithm ? EXT MSG: Guys thank you all for your hints! Practical experience is irreplaceable that's why I posted this query here. I could read all week the mailing list archives and the respective internet resources but still not find the key info I could potentially get by someone here. I did PCA on my training set (this one has 24 positive and 1278 negative observation) and projected the 19 features on the first 2 PCs, which explain 87.6 % of the variance in the data. Does this plot help to decide which classification algorithms and/or over- or under-sampling would be more suitable? https://dl.dropboxusercontent.com/u/48168252/PCA_of_features.png thanks for your advices Thomas On 16 November 2016 at 22:20, Sebastian Raschka > wrote: Yeah, there are many useful resources and implementations scattered around the web. However, a good, brief overview of the general ideas and concepts would be this one, for example: http://www.svds.com/learning-imbalanced-classes/ > On Nov 16, 2016, at 3:54 PM, Dale T Smith > wrote: > > Unbalanced class classification has been a topic here in past years, and there are posts if you search the archives. There are also plenty of resources available to help you, from actual code on Stackoverflow, to papers that address various ideas. I don?t think it?s necessary to repeat any of this on the mailing list. > > > __________________________________________________________________________________________________________________________________________ > Dale T. Smith | Macy's Systems and Technology | IFS eCom CSE Data Science > 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com > > From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Fernando Marcos Wittmann > Sent: Wednesday, November 16, 2016 3:11 PM > To: Scikit-learn user and developer mailing list > Subject: Re: [scikit-learn] suggested classification algorithm > > ? EXT MSG: > Three based algorithms (like Random Forest) usually work well for imbalanced datasets. You can also take a look at the SMOTE technique (http://jair.org/media/953/live-953-2037-jair.pdf) which you can use for over-sampling the positive observations. > > On Mon, Nov 14, 2016 at 9:14 AM, Thomas Evangelidis > wrote: > Greetings, > > I want to design a program that can deal with classification problems of the same type, where the number of positive observations is small but the number of negative much larger. Speaking with numbers, the number of positive observations could range usually between 2 to 20 and the number of negative could be at least x30 times larger. The number of features could be between 2 and 20 too, but that could be reduced using feature selection and elimination algorithms. I 've read in the documentation that some algorithms like the SVM are still effective when the number of dimensions is greater than the number of samples, but I am not sure if they are suitable for my case. Moreover, according to this Figure, the Nearest Neighbors is the best and second is the RBF SVM: > > http://scikit-learn.org/stable/_images/sphx_glr_plot_classifier_comparison_001.png > > However, I assume that Nearest Neighbors would not be effective in my case where the number of positive observations is very low. For these reasons I would like to know your expert opinion about which classification algorithm should I try first. > > thanks in advance > Thomas > > > -- > ====================================================================== > Thomas Evangelidis > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > tevang3 at gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -- > > Fernando Marcos Wittmann > MS Student - Energy Systems Dept. > School of Electrical and Computer Engineering, FEEC > University of Campinas, UNICAMP, Brazil > +55 (19) 987-211302 > > * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments. -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Thu Nov 17 10:17:32 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Thu, 17 Nov 2016 10:17:32 -0500 Subject: [scikit-learn] suggested classification algorithm In-Reply-To: References: <0C66AA1E-D7FC-4DCD-9DBD-FED8020A0296@gmail.com> Message-ID: One problem with the PCA approach is also that it doesn?t tell you how ?discriminative? these features are in a >2 dimensional space, e.g., by a nonlinear model. Or in other words, I think it is hard to tell whether the class imbalance is a big problem in this task just from looking at a linear transformation and compression of the dataset. I think looking at confusion matrices and ROC curves for some models could help to determine if the class imbalance is a challenge for a learning algorithm in higher dimensional space? > On Nov 17, 2016, at 9:00 AM, Thomas Evangelidis wrote: > > > Guys thank you all for your hints! Practical experience is irreplaceable that's why I posted this query here. I could read all week the mailing list archives and the respective internet resources but still not find the key info I could potentially get by someone here. > > I did PCA on my training set (this one has 24 positive and 1278 negative observation) and projected the 19 features on the first 2 PCs, which explain 87.6 % of the variance in the data. Does this plot help to decide which classification algorithms and/or over- or under-sampling would be more suitable? > > https://dl.dropboxusercontent.com/u/48168252/PCA_of_features.png > > thanks for your advices > Thomas > > > On 16 November 2016 at 22:20, Sebastian Raschka wrote: > Yeah, there are many useful resources and implementations scattered around the web. However, a good, brief overview of the general ideas and concepts would be this one, for example: http://www.svds.com/learning-imbalanced-classes/ > > > > On Nov 16, 2016, at 3:54 PM, Dale T Smith wrote: > > > > Unbalanced class classification has been a topic here in past years, and there are posts if you search the archives. There are also plenty of resources available to help you, from actual code on Stackoverflow, to papers that address various ideas. I don?t think it?s necessary to repeat any of this on the mailing list. > > > > > > __________________________________________________________________________________________________________________________________________ > > Dale T. Smith | Macy's Systems and Technology | IFS eCom CSE Data Science > > 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com > > > > From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Fernando Marcos Wittmann > > Sent: Wednesday, November 16, 2016 3:11 PM > > To: Scikit-learn user and developer mailing list > > Subject: Re: [scikit-learn] suggested classification algorithm > > > > ? EXT MSG: > > Three based algorithms (like Random Forest) usually work well for imbalanced datasets. You can also take a look at the SMOTE technique (http://jair.org/media/953/live-953-2037-jair.pdf) which you can use for over-sampling the positive observations. > > > > On Mon, Nov 14, 2016 at 9:14 AM, Thomas Evangelidis wrote: > > Greetings, > > > > I want to design a program that can deal with classification problems of the same type, where the number of positive observations is small but the number of negative much larger. Speaking with numbers, the number of positive observations could range usually between 2 to 20 and the number of negative could be at least x30 times larger. The number of features could be between 2 and 20 too, but that could be reduced using feature selection and elimination algorithms. I 've read in the documentation that some algorithms like the SVM are still effective when the number of dimensions is greater than the number of samples, but I am not sure if they are suitable for my case. Moreover, according to this Figure, the Nearest Neighbors is the best and second is the RBF SVM: > > > > http://scikit-learn.org/stable/_images/sphx_glr_plot_classifier_comparison_001.png > > > > However, I assume that Nearest Neighbors would not be effective in my case where the number of positive observations is very low. For these reasons I would like to know your expert opinion about which classification algorithm should I try first. > > > > thanks in advance > > Thomas > > > > > > -- > > ====================================================================== > > Thomas Evangelidis > > Research Specialist > > CEITEC - Central European Institute of Technology > > Masaryk University > > Kamenice 5/A35/1S081, > > 62500 Brno, Czech Republic > > > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > -- > > > > Fernando Marcos Wittmann > > MS Student - Energy Systems Dept. > > School of Electrical and Computer Engineering, FEEC > > University of Campinas, UNICAMP, Brazil > > +55 (19) 987-211302 > > > > * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments. > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > -- > ====================================================================== > Thomas Evangelidis > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > tevang3 at gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From t3kcit at gmail.com Thu Nov 17 14:59:53 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 17 Nov 2016 14:59:53 -0500 Subject: [scikit-learn] Development workflow proposal: merge master instead of rebasing Message-ID: <05697773-f028-6cb6-67b8-7621239ed94f@gmail.com> Hi all. I think we should change our development practices for resolving merge-conflicts from rebasing to merging. The "squash and merge" button of github gets rid of any merge commits and results in a clean history in any case. The benefit of merging instead of rebasing is that github is able to track comments much better if you don't force-push. In particular the links in notification emails might work better when not doing force-pushes. I'm not entirely sure how the mechanism works, but I think it's worth giving it a go. When merging master it's also harder to screw up an PR entirely (I think) which would make it easier to people new to git. Wdyt? Andy From gael.varoquaux at normalesup.org Thu Nov 17 15:10:43 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Thu, 17 Nov 2016 21:10:43 +0100 Subject: [scikit-learn] Development workflow proposal: merge master instead of rebasing In-Reply-To: <05697773-f028-6cb6-67b8-7621239ed94f@gmail.com> References: <05697773-f028-6cb6-67b8-7621239ed94f@gmail.com> Message-ID: <71dd9713-0e8a-41d5-801b-8bd2fe1b636f@typeapp.com> Can the squash and merge button of github actually deal with this?? It's not obvious to me that it is even possible. G ?Sent from my phone. Please forgive brevity and mis spelling? On Nov 17, 2016, 21:02, at 21:02, Andreas Mueller wrote: >Hi all. > >I think we should change our development practices for resolving >merge-conflicts from rebasing to merging. >The "squash and merge" button of github gets rid of any merge commits >and results in a clean history in any case. > >The benefit of merging instead of rebasing is that github is able to >track comments much better if you don't force-push. >In particular the links in notification emails might work better when >not doing force-pushes. >I'm not entirely sure how the mechanism works, but I think it's worth >giving it a go. >When merging master it's also harder to screw up an PR entirely (I >think) which would make it easier to people new to git. > >Wdyt? > >Andy >_______________________________________________ >scikit-learn mailing list >scikit-learn at python.org >https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Thu Nov 17 15:38:42 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Fri, 18 Nov 2016 07:38:42 +1100 Subject: [scikit-learn] Development workflow proposal: merge master instead of rebasing In-Reply-To: <71dd9713-0e8a-41d5-801b-8bd2fe1b636f@typeapp.com> References: <05697773-f028-6cb6-67b8-7621239ed94f@gmail.com> <71dd9713-0e8a-41d5-801b-8bd2fe1b636f@typeapp.com> Message-ID: Of course it can deal with this: "Squash and merge" just takes the diff between the master and the branch merged with master, and applies it as a fresh patch on master (borrowing author and timestamp). Think `git merge --squash` more than the squash feature of `git rebase --interactive`. On 18 November 2016 at 07:10, Gael Varoquaux wrote: > Can the squash and merge button of github actually deal with this? It's > not obvious to me that it is even possible. > > G > > Sent from my phone. Please forgive brevity and mis spelling > On Nov 17, 2016, at 21:02, Andreas Mueller wrote: >> >> Hi all. >> >> I think we should change our development practices for resolving >> merge-conflicts from rebasing to merging. >> The "squash and merge" button of github gets rid of any merge commits >> and results in a clean history in any case. >> >> The benefit of merging instead of rebasing is that github is able to >> track comments much better if you don't force-push. >> In particular the links in notification emails might work better when >> not doing force-pushes. >> I'm not entirely sure how the mechanism works, but I think it's worth >> giving it a go. >> When merging master it's also harder to screw up an PR entirely (I >> think) which would make it easier to people new to git. >> >> Wdyt? >> >> Andy >> ------------------------------ >> >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Nov 18 08:44:36 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 18 Nov 2016 08:44:36 -0500 Subject: [scikit-learn] Development workflow proposal: merge master instead of rebasing In-Reply-To: <71dd9713-0e8a-41d5-801b-8bd2fe1b636f@typeapp.com> References: <05697773-f028-6cb6-67b8-7621239ed94f@gmail.com> <71dd9713-0e8a-41d5-801b-8bd2fe1b636f@typeapp.com> Message-ID: <5586bc91-d79c-c575-9920-57ba3bdc78c3@gmail.com> On 11/17/2016 03:10 PM, Gael Varoquaux wrote: > Can the squash and merge button of github actually deal with this? > It's not obvious to me that it is even possible. Yeah I was wondering about that, but it totally works. From olivier.grisel at ensta.org Sun Nov 20 09:29:38 2016 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Sun, 20 Nov 2016 15:29:38 +0100 Subject: [scikit-learn] Development workflow proposal: merge master instead of rebasing In-Reply-To: <5586bc91-d79c-c575-9920-57ba3bdc78c3@gmail.com> References: <05697773-f028-6cb6-67b8-7621239ed94f@gmail.com> <71dd9713-0e8a-41d5-801b-8bd2fe1b636f@typeapp.com> <5586bc91-d79c-c575-9920-57ba3bdc78c3@gmail.com> Message-ID: If it works, +1 on my side. I think I have never used `git merge --rebase` in the past. -- Olivier From linjia at ruijie.com.cn Wed Nov 23 04:27:28 2016 From: linjia at ruijie.com.cn (linjia at ruijie.com.cn) Date: Wed, 23 Nov 2016 09:27:28 +0000 Subject: [scikit-learn] =?gb2312?b?cXVlc3Rpb24gYWJvdXQgdXNpbmcgc2tsZWFy?= =?gb2312?b?bi5uZXVyYWxfbmV0d29yay5NTFBDbGFzc2lmaWVyo78=?= Message-ID: <265E382B26F78742B972BE038FD292C6A79A86@fzex2.ruijie.com.cn> Hi everyone I try to use sklearn.neural_network.MLPClassifier to test the XOR operation, but I found the result is not satisfied. The following is code, can you tell me if I use the lib incorrectly? from sklearn.neural_network import MLPClassifier X = [[0, 0], [0, 1], [1, 0], [1, 1]] y = [0, 1, 1, 0] clf = MLPClassifier(solver='adam', activation='logistic', alpha=1e-3, hidden_layer_sizes=(2,), max_iter=1000) clf.fit(X, y) res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) print(res) #result is [0 0 0 0], score is 0.5 -------------- next part -------------- An HTML attachment was scrubbed... URL: From deshpande.jaidev at gmail.com Wed Nov 23 05:15:07 2016 From: deshpande.jaidev at gmail.com (Jaidev Deshpande) Date: Wed, 23 Nov 2016 10:15:07 +0000 Subject: [scikit-learn] Specifying exceptions to ParameterGrid Message-ID: Hi, Sometimes when using GridSearchCV, I realize that in the grid there are certain combinations of hyperparameters that are either incompatible or redundant. For example, when using an MLP, if I specify the following grid: grid = {'solver': ['sgd', 'adam'], 'learning_rate': ['constant', 'invscaling', 'adaptive']} then it yields the following ParameterGrid: [{'learning_rate': 'constant', 'solver': 'sgd'}, {'learning_rate': 'constant', 'solver': 'adam'}, {'learning_rate': 'invscaling', 'solver': 'sgd'}, {'learning_rate': 'invscaling', 'solver': 'adam'}, {'learning_rate': 'adaptive', 'solver': 'sgd'}, {'learning_rate': 'adaptive', 'solver': 'adam'}] Now, three of these are redundant, since learning_rate is used only for the sgd solver. Ideally I'd like to specify these cases upfront, and for that I have a simple hack ( https://github.com/jaidevd/jarvis/blob/master/jarvis/cross_validation.py#L38). Using that yields a ParameterGrid as follows: [{'learning_rate': 'constant', 'solver': 'adam'}, {'learning_rate': 'invscaling', 'solver': 'adam'}, {'learning_rate': 'adaptive', 'solver': 'adam'}] which is then simply removed from the original ParameterGrid. I wonder if there's a simpler way of doing this. Would it help if we had an additional parameter (something like "grid_exceptions") in GridSearchCV, which would remove these dicts from the list of parameters? Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From ragvrv at gmail.com Wed Nov 23 05:57:14 2016 From: ragvrv at gmail.com (Raghav R V) Date: Wed, 23 Nov 2016 11:57:14 +0100 Subject: [scikit-learn] Specifying exceptions to ParameterGrid In-Reply-To: References: Message-ID: Hi! What you could do is specify lists of dicts to group the parameters which apply together in one dict... [{'learning_rate': ['constant', 'invscaling', 'adaptive'], 'solver': 'sgd'}, {'solver': 'adam'}] ```py from sklearn.neural_network import MLPClassifier from sklearn.model_selection import GridSearchCV from sklearn.datasets import make_classification from pandas import DataFrame X, y = make_classification(random_state=42) gs = GridSearchCV(MLPClassifier(random_state=42), param_grid=[{'learning_rate': ['constant', 'invscaling', 'adaptive'], 'solver': ['sgd',]}, {'solver': ['adam',]}]) DataFrame(gs.fit(X, y).cv_results_) ``` Would give [image: Inline image 1] HTH :) On Wed, Nov 23, 2016 at 11:15 AM, Jaidev Deshpande < deshpande.jaidev at gmail.com> wrote: > Hi, > > Sometimes when using GridSearchCV, I realize that in the grid there are > certain combinations of hyperparameters that are either incompatible or > redundant. For example, when using an MLP, if I specify the following grid: > > grid = {'solver': ['sgd', 'adam'], 'learning_rate': ['constant', > 'invscaling', 'adaptive']} > > then it yields the following ParameterGrid: > > [{'learning_rate': 'constant', 'solver': 'sgd'}, > {'learning_rate': 'constant', 'solver': 'adam'}, > {'learning_rate': 'invscaling', 'solver': 'sgd'}, > {'learning_rate': 'invscaling', 'solver': 'adam'}, > {'learning_rate': 'adaptive', 'solver': 'sgd'}, > {'learning_rate': 'adaptive', 'solver': 'adam'}] > > Now, three of these are redundant, since learning_rate is used only for > the sgd solver. Ideally I'd like to specify these cases upfront, and for > that I have a simple hack (https://github.com/jaidevd/ja > rvis/blob/master/jarvis/cross_validation.py#L38). Using that yields a > ParameterGrid as follows: > > [{'learning_rate': 'constant', 'solver': 'adam'}, > {'learning_rate': 'invscaling', 'solver': 'adam'}, > {'learning_rate': 'adaptive', 'solver': 'adam'}] > > which is then simply removed from the original ParameterGrid. > > I wonder if there's a simpler way of doing this. Would it help if we had > an additional parameter (something like "grid_exceptions") in GridSearchCV, > which would remove these dicts from the list of parameters? > > Thanks > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Raghav RV https://github.com/raghavrv -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 52450 bytes Desc: not available URL: From ragvrv at gmail.com Wed Nov 23 06:04:15 2016 From: ragvrv at gmail.com (Raghav R V) Date: Wed, 23 Nov 2016 12:04:15 +0100 Subject: [scikit-learn] =?utf-8?q?question_about_using_sklearn=2Eneural?= =?utf-8?q?=5Fnetwork=2EMLPClassifier=EF=BC=9F?= In-Reply-To: <265E382B26F78742B972BE038FD292C6A79A86@fzex2.ruijie.com.cn> References: <265E382B26F78742B972BE038FD292C6A79A86@fzex2.ruijie.com.cn> Message-ID: Hi, If you keep everything at their default values, it seems to work - ```py from sklearn.neural_network import MLPClassifier X = [[0, 0], [0, 1], [1, 0], [1, 1]] y = [0, 1, 1, 0] clf = MLPClassifier(max_iter=1000) clf.fit(X, y) res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) print(res) ``` On Wed, Nov 23, 2016 at 10:27 AM, wrote: > Hi everyone > > > > I try to use sklearn.neural_network.MLPClassifier to test the XOR > operation, but I found the result is not satisfied. The following is code, > can you tell me if I use the lib incorrectly? > > > > from sklearn.neural_network import MLPClassifier > > X = [[0, 0], [0, 1], [1, 0], [1, 1]] > > y = [0, 1, 1, 0] > > clf = MLPClassifier(solver='adam', activation='logistic', alpha=1e-3, > hidden_layer_sizes=(2,), max_iter=1000) > > clf.fit(X, y) > > res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) > > print(res) > > > > > > #result is [0 0 0 0], score is 0.5 > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Raghav RV https://github.com/raghavrv -------------- next part -------------- An HTML attachment was scrubbed... URL: From linjia at ruijie.com.cn Wed Nov 23 06:26:26 2016 From: linjia at ruijie.com.cn (linjia at ruijie.com.cn) Date: Wed, 23 Nov 2016 11:26:26 +0000 Subject: [scikit-learn] =?utf-8?b?562U5aSNOiAgCXF1ZXN0aW9uIGFib3V0IHVzaW5n?= =?utf-8?q?_sklearn=2Eneural=5Fnetwork=2EMLPClassifier=EF=BC=9F?= In-Reply-To: References: <265E382B26F78742B972BE038FD292C6A79A86@fzex2.ruijie.com.cn> Message-ID: <265E382B26F78742B972BE038FD292C6A79DA5@fzex2.ruijie.com.cn> Yes?you are right @ Raghav R V, thx! However, i found the key param is ?hidden_layer_sizes=[2]?, I wonder if I misunderstand the meaning of parameter of hidden_layer_sizes? Is it related to the topic : http://stackoverflow.com/questions/36819287/mlp-classifier-of-scikit-neuralnetwork-not-working-for-xor ???: scikit-learn [mailto:scikit-learn-bounces+linjia=ruijie.com.cn at python.org] ?? Raghav R V ????: 2016?11?23? 19:04 ???: Scikit-learn user and developer mailing list ??: Re: [scikit-learn] question about using sklearn.neural_network.MLPClassifier? Hi, If you keep everything at their default values, it seems to work - ```py from sklearn.neural_network import MLPClassifier X = [[0, 0], [0, 1], [1, 0], [1, 1]] y = [0, 1, 1, 0] clf = MLPClassifier(max_iter=1000) clf.fit(X, y) res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) print(res) ``` On Wed, Nov 23, 2016 at 10:27 AM, > wrote: Hi everyone I try to use sklearn.neural_network.MLPClassifier to test the XOR operation, but I found the result is not satisfied. The following is code, can you tell me if I use the lib incorrectly? from sklearn.neural_network import MLPClassifier X = [[0, 0], [0, 1], [1, 0], [1, 1]] y = [0, 1, 1, 0] clf = MLPClassifier(solver='adam', activation='logistic', alpha=1e-3, hidden_layer_sizes=(2,), max_iter=1000) clf.fit(X, y) res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) print(res) #result is [0 0 0 0], score is 0.5 _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -- Raghav RV https://github.com/raghavrv -------------- next part -------------- An HTML attachment was scrubbed... URL: From deshpande.jaidev at gmail.com Wed Nov 23 06:52:37 2016 From: deshpande.jaidev at gmail.com (Jaidev Deshpande) Date: Wed, 23 Nov 2016 11:52:37 +0000 Subject: [scikit-learn] Specifying exceptions to ParameterGrid In-Reply-To: References: Message-ID: On Wed, 23 Nov 2016 at 16:29 Raghav R V wrote: > Hi! > > What you could do is specify lists of dicts to group the parameters which > apply together in one dict... > > [{'learning_rate': ['constant', 'invscaling', 'adaptive'], 'solver': > 'sgd'}, {'solver': 'adam'}] > > ```py > from sklearn.neural_network import MLPClassifier > from sklearn.model_selection import GridSearchCV > from sklearn.datasets import make_classification > > from pandas import DataFrame > > X, y = make_classification(random_state=42) > > gs = GridSearchCV(MLPClassifier(random_state=42), > param_grid=[{'learning_rate': ['constant', 'invscaling', > 'adaptive'], > 'solver': ['sgd',]}, > {'solver': ['adam',]}]) > > DataFrame(gs.fit(X, y).cv_results_) > ``` > > Would give > > [image: image.png] > > HTH :) > Haha, this is perfect. I didn't know you could pass a list of dicts to param_grid. Thanks! > > On Wed, Nov 23, 2016 at 11:15 AM, Jaidev Deshpande < > deshpande.jaidev at gmail.com> wrote: > > Hi, > > Sometimes when using GridSearchCV, I realize that in the grid there are > certain combinations of hyperparameters that are either incompatible or > redundant. For example, when using an MLP, if I specify the following grid: > > grid = {'solver': ['sgd', 'adam'], 'learning_rate': ['constant', > 'invscaling', 'adaptive']} > > then it yields the following ParameterGrid: > > [{'learning_rate': 'constant', 'solver': 'sgd'}, > {'learning_rate': 'constant', 'solver': 'adam'}, > {'learning_rate': 'invscaling', 'solver': 'sgd'}, > {'learning_rate': 'invscaling', 'solver': 'adam'}, > {'learning_rate': 'adaptive', 'solver': 'sgd'}, > {'learning_rate': 'adaptive', 'solver': 'adam'}] > > Now, three of these are redundant, since learning_rate is used only for > the sgd solver. Ideally I'd like to specify these cases upfront, and for > that I have a simple hack ( > https://github.com/jaidevd/jarvis/blob/master/jarvis/cross_validation.py#L38). > Using that yields a ParameterGrid as follows: > > [{'learning_rate': 'constant', 'solver': 'adam'}, > {'learning_rate': 'invscaling', 'solver': 'adam'}, > {'learning_rate': 'adaptive', 'solver': 'adam'}] > > which is then simply removed from the original ParameterGrid. > > I wonder if there's a simpler way of doing this. Would it help if we had > an additional parameter (something like "grid_exceptions") in GridSearchCV, > which would remove these dicts from the list of parameters? > > Thanks > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -- > Raghav RV > https://github.com/raghavrv > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 52450 bytes Desc: not available URL: From joel.nothman at gmail.com Wed Nov 23 06:59:20 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 23 Nov 2016 22:59:20 +1100 Subject: [scikit-learn] Specifying exceptions to ParameterGrid In-Reply-To: References: Message-ID: Raghav's example of [{'learning_rate': ['constant', 'invscaling', 'adaptive'], 'solver': 'sgd'}, {'solver': 'adam'}] was not correct. Should be [{'learning_rate': ['constant', 'invscaling', 'adaptive'], 'solver': ['sgd']}, {'solver': ['adam']}] (Note all values of dicts are lists) On 23 November 2016 at 22:52, Jaidev Deshpande wrote: > > > On Wed, 23 Nov 2016 at 16:29 Raghav R V wrote: > >> Hi! >> >> What you could do is specify lists of dicts to group the parameters which >> apply together in one dict... >> >> [{'learning_rate': ['constant', 'invscaling', 'adaptive'], 'solver': >> 'sgd'}, {'solver': 'adam'}] >> >> ```py >> from sklearn.neural_network import MLPClassifier >> from sklearn.model_selection import GridSearchCV >> from sklearn.datasets import make_classification >> >> from pandas import DataFrame >> >> X, y = make_classification(random_state=42) >> >> gs = GridSearchCV(MLPClassifier(random_state=42), >> param_grid=[{'learning_rate': ['constant', >> 'invscaling', 'adaptive'], >> 'solver': ['sgd',]}, >> {'solver': ['adam',]}]) >> >> DataFrame(gs.fit(X, y).cv_results_) >> ``` >> >> Would give >> >> [image: image.png] >> >> HTH :) >> > > Haha, this is perfect. I didn't know you could pass a list of dicts to > param_grid. > > Thanks! > > >> >> On Wed, Nov 23, 2016 at 11:15 AM, Jaidev Deshpande < >> deshpande.jaidev at gmail.com> wrote: >> >> Hi, >> >> Sometimes when using GridSearchCV, I realize that in the grid there are >> certain combinations of hyperparameters that are either incompatible or >> redundant. For example, when using an MLP, if I specify the following grid: >> >> grid = {'solver': ['sgd', 'adam'], 'learning_rate': ['constant', >> 'invscaling', 'adaptive']} >> >> then it yields the following ParameterGrid: >> >> [{'learning_rate': 'constant', 'solver': 'sgd'}, >> {'learning_rate': 'constant', 'solver': 'adam'}, >> {'learning_rate': 'invscaling', 'solver': 'sgd'}, >> {'learning_rate': 'invscaling', 'solver': 'adam'}, >> {'learning_rate': 'adaptive', 'solver': 'sgd'}, >> {'learning_rate': 'adaptive', 'solver': 'adam'}] >> >> Now, three of these are redundant, since learning_rate is used only for >> the sgd solver. Ideally I'd like to specify these cases upfront, and for >> that I have a simple hack (https://github.com/jaidevd/ >> jarvis/blob/master/jarvis/cross_validation.py#L38). Using that yields a >> ParameterGrid as follows: >> >> [{'learning_rate': 'constant', 'solver': 'adam'}, >> {'learning_rate': 'invscaling', 'solver': 'adam'}, >> {'learning_rate': 'adaptive', 'solver': 'adam'}] >> >> which is then simply removed from the original ParameterGrid. >> >> I wonder if there's a simpler way of doing this. Would it help if we had >> an additional parameter (something like "grid_exceptions") in GridSearchCV, >> which would remove these dicts from the list of parameters? >> >> Thanks >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> -- >> Raghav RV >> https://github.com/raghavrv >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 52450 bytes Desc: not available URL: From deshpande.jaidev at gmail.com Wed Nov 23 07:05:23 2016 From: deshpande.jaidev at gmail.com (Jaidev Deshpande) Date: Wed, 23 Nov 2016 12:05:23 +0000 Subject: [scikit-learn] Specifying exceptions to ParameterGrid In-Reply-To: References: Message-ID: On Wed, 23 Nov 2016 at 17:31 Joel Nothman wrote: > Raghav's example of > > > [{'learning_rate': ['constant', 'invscaling', 'adaptive'], 'solver': > 'sgd'}, {'solver': 'adam'}] > > was not correct. > > Should be > > > [{'learning_rate': ['constant', 'invscaling', 'adaptive'], 'solver': > ['sgd']}, {'solver': ['adam']}] > > (Note all values of dicts are lists) > Ah, thanks! (Just ran into an error as it started iterating over the "sgd".) > > On 23 November 2016 at 22:52, Jaidev Deshpande > wrote: > > > > On Wed, 23 Nov 2016 at 16:29 Raghav R V wrote: > > Hi! > > What you could do is specify lists of dicts to group the parameters which > apply together in one dict... > > [{'learning_rate': ['constant', 'invscaling', 'adaptive'], 'solver': > 'sgd'}, {'solver': 'adam'}] > > ```py > from sklearn.neural_network import MLPClassifier > from sklearn.model_selection import GridSearchCV > from sklearn.datasets import make_classification > > from pandas import DataFrame > > X, y = make_classification(random_state=42) > > gs = GridSearchCV(MLPClassifier(random_state=42), > param_grid=[{'learning_rate': ['constant', 'invscaling', > 'adaptive'], > 'solver': ['sgd',]}, > {'solver': ['adam',]}]) > > DataFrame(gs.fit(X, y).cv_results_) > ``` > > Would give > > [image: image.png] > > HTH :) > > > Haha, this is perfect. I didn't know you could pass a list of dicts to > param_grid. > > Thanks! > > > > On Wed, Nov 23, 2016 at 11:15 AM, Jaidev Deshpande < > deshpande.jaidev at gmail.com> wrote: > > Hi, > > Sometimes when using GridSearchCV, I realize that in the grid there are > certain combinations of hyperparameters that are either incompatible or > redundant. For example, when using an MLP, if I specify the following grid: > > grid = {'solver': ['sgd', 'adam'], 'learning_rate': ['constant', > 'invscaling', 'adaptive']} > > then it yields the following ParameterGrid: > > [{'learning_rate': 'constant', 'solver': 'sgd'}, > {'learning_rate': 'constant', 'solver': 'adam'}, > {'learning_rate': 'invscaling', 'solver': 'sgd'}, > {'learning_rate': 'invscaling', 'solver': 'adam'}, > {'learning_rate': 'adaptive', 'solver': 'sgd'}, > {'learning_rate': 'adaptive', 'solver': 'adam'}] > > Now, three of these are redundant, since learning_rate is used only for > the sgd solver. Ideally I'd like to specify these cases upfront, and for > that I have a simple hack ( > https://github.com/jaidevd/jarvis/blob/master/jarvis/cross_validation.py#L38). > Using that yields a ParameterGrid as follows: > > [{'learning_rate': 'constant', 'solver': 'adam'}, > {'learning_rate': 'invscaling', 'solver': 'adam'}, > {'learning_rate': 'adaptive', 'solver': 'adam'}] > > which is then simply removed from the original ParameterGrid. > > I wonder if there's a simpler way of doing this. Would it help if we had > an additional parameter (something like "grid_exceptions") in GridSearchCV, > which would remove these dicts from the list of parameters? > > Thanks > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -- > Raghav RV > https://github.com/raghavrv > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 52450 bytes Desc: not available URL: From ragvrv at gmail.com Wed Nov 23 07:58:51 2016 From: ragvrv at gmail.com (Raghav R V) Date: Wed, 23 Nov 2016 13:58:51 +0100 Subject: [scikit-learn] Specifying exceptions to ParameterGrid In-Reply-To: References: Message-ID: On Wed, Nov 23, 2016 at 12:59 PM, Joel Nothman wrote: > Raghav's example of > > > [{'learning_rate': ['constant', 'invscaling', 'adaptive'], 'solver': > 'sgd'}, {'solver': 'adam'}] > > was not correct. > Oops sorry. Ah I ran into that, corrected it in the snipped but forgot to update the line before the snippet... :) > Should be > > > [{'learning_rate': ['constant', 'invscaling', 'adaptive'], 'solver': > ['sgd']}, {'solver': ['adam']}] > > (Note all values of dicts are lists) > > On 23 November 2016 at 22:52, Jaidev Deshpande > wrote: > >> >> >> On Wed, 23 Nov 2016 at 16:29 Raghav R V wrote: >> >>> Hi! >>> >>> What you could do is specify lists of dicts to group the parameters >>> which apply together in one dict... >>> >>> [{'learning_rate': ['constant', 'invscaling', 'adaptive'], 'solver': >>> 'sgd'}, {'solver': 'adam'}] >>> >>> ```py >>> from sklearn.neural_network import MLPClassifier >>> from sklearn.model_selection import GridSearchCV >>> from sklearn.datasets import make_classification >>> >>> from pandas import DataFrame >>> >>> X, y = make_classification(random_state=42) >>> >>> gs = GridSearchCV(MLPClassifier(random_state=42), >>> param_grid=[{'learning_rate': ['constant', >>> 'invscaling', 'adaptive'], >>> 'solver': ['sgd',]}, >>> {'solver': ['adam',]}]) >>> >>> DataFrame(gs.fit(X, y).cv_results_) >>> ``` >>> >>> Would give >>> >>> [image: image.png] >>> >>> HTH :) >>> >> >> Haha, this is perfect. I didn't know you could pass a list of dicts to >> param_grid. >> >> Thanks! >> >> >>> >>> On Wed, Nov 23, 2016 at 11:15 AM, Jaidev Deshpande < >>> deshpande.jaidev at gmail.com> wrote: >>> >>> Hi, >>> >>> Sometimes when using GridSearchCV, I realize that in the grid there are >>> certain combinations of hyperparameters that are either incompatible or >>> redundant. For example, when using an MLP, if I specify the following grid: >>> >>> grid = {'solver': ['sgd', 'adam'], 'learning_rate': ['constant', >>> 'invscaling', 'adaptive']} >>> >>> then it yields the following ParameterGrid: >>> >>> [{'learning_rate': 'constant', 'solver': 'sgd'}, >>> {'learning_rate': 'constant', 'solver': 'adam'}, >>> {'learning_rate': 'invscaling', 'solver': 'sgd'}, >>> {'learning_rate': 'invscaling', 'solver': 'adam'}, >>> {'learning_rate': 'adaptive', 'solver': 'sgd'}, >>> {'learning_rate': 'adaptive', 'solver': 'adam'}] >>> >>> Now, three of these are redundant, since learning_rate is used only for >>> the sgd solver. Ideally I'd like to specify these cases upfront, and for >>> that I have a simple hack (https://github.com/jaidevd/ja >>> rvis/blob/master/jarvis/cross_validation.py#L38). Using that yields a >>> ParameterGrid as follows: >>> >>> [{'learning_rate': 'constant', 'solver': 'adam'}, >>> {'learning_rate': 'invscaling', 'solver': 'adam'}, >>> {'learning_rate': 'adaptive', 'solver': 'adam'}] >>> >>> which is then simply removed from the original ParameterGrid. >>> >>> I wonder if there's a simpler way of doing this. Would it help if we had >>> an additional parameter (something like "grid_exceptions") in GridSearchCV, >>> which would remove these dicts from the list of parameters? >>> >>> Thanks >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> >>> >>> -- >>> Raghav RV >>> https://github.com/raghavrv >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Raghav RV https://github.com/raghavrv -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 52450 bytes Desc: not available URL: From rth.yurchak at gmail.com Wed Nov 23 05:31:12 2016 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Wed, 23 Nov 2016 11:31:12 +0100 Subject: [scikit-learn] Specifying exceptions to ParameterGrid In-Reply-To: References: Message-ID: <58356FF0.4030209@gmail.com> Hi Jaidev, well, `param_grid` in GridSearchCV can also be a list of dictionaries, so you could directly specify the cases you are interested in (instead of the full grid - exceptions), which might be simpler? On 23/11/16 11:15, Jaidev Deshpande wrote: > Hi, > > Sometimes when using GridSearchCV, I realize that in the grid there are > certain combinations of hyperparameters that are either incompatible or > redundant. For example, when using an MLP, if I specify the following grid: > > grid = {'solver': ['sgd', 'adam'], 'learning_rate': ['constant', > 'invscaling', 'adaptive']} > > then it yields the following ParameterGrid: > > [{'learning_rate': 'constant', 'solver': 'sgd'}, > {'learning_rate': 'constant', 'solver': 'adam'}, > {'learning_rate': 'invscaling', 'solver': 'sgd'}, > {'learning_rate': 'invscaling', 'solver': 'adam'}, > {'learning_rate': 'adaptive', 'solver': 'sgd'}, > {'learning_rate': 'adaptive', 'solver': 'adam'}] > > Now, three of these are redundant, since learning_rate is used only for > the sgd solver. Ideally I'd like to specify these cases upfront, and for > that I have a simple hack > (https://github.com/jaidevd/jarvis/blob/master/jarvis/cross_validation.py#L38). > Using that yields a ParameterGrid as follows: > > [{'learning_rate': 'constant', 'solver': 'adam'}, > {'learning_rate': 'invscaling', 'solver': 'adam'}, > {'learning_rate': 'adaptive', 'solver': 'adam'}] > > which is then simply removed from the original ParameterGrid. > > I wonder if there's a simpler way of doing this. Would it help if we had > an additional parameter (something like "grid_exceptions") in > GridSearchCV, which would remove these dicts from the list of parameters? > > Thanks > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From se.raschka at gmail.com Wed Nov 23 14:06:06 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Wed, 23 Nov 2016 14:06:06 -0500 Subject: [scikit-learn] =?utf-8?q?question_about_using_sklearn=2Eneural?= =?utf-8?q?=5Fnetwork=2EMLPClassifier=EF=BC=9F?= In-Reply-To: <265E382B26F78742B972BE038FD292C6A79DA5@fzex2.ruijie.com.cn> References: <265E382B26F78742B972BE038FD292C6A79A86@fzex2.ruijie.com.cn> <265E382B26F78742B972BE038FD292C6A79DA5@fzex2.ruijie.com.cn> Message-ID: <4E63AF55-B6EA-4505-827F-B1D69D3F458B@gmail.com> > If you keep everything at their default values, it seems to work - > > ```py > from sklearn.neural_network import MLPClassifier > X = [[0, 0], [0, 1], [1, 0], [1, 1]] > y = [0, 1, 1, 0] > clf = MLPClassifier(max_iter=1000) > clf.fit(X, y) > res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) > print(res) > ``` The default is set 100 units in the hidden layer, but theoretically, it should work with 2 hidden logistic units (I think that?s the typical textbook/class example). I think what happens is that it gets stuck in local minima depending on the random weight initialization. E.g., the following works just fine: from sklearn.neural_network import MLPClassifier X = [[0, 0], [0, 1], [1, 0], [1, 1]] y = [0, 1, 1, 0] clf = MLPClassifier(solver='lbfgs', activation='logistic', alpha=0.0, hidden_layer_sizes=(2,), learning_rate_init=0.1, max_iter=1000, random_state=20) clf.fit(X, y) res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) print(res) print(clf.loss_) but changing the random seed to 1 leads to: [0 1 1 1] 0.34660921283 For comparison, I used a more vanilla MLP (1 hidden layer with 2 units and logistic activation as well; https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch12/ch12.ipynb), essentially resulting in the same problem: > On Nov 23, 2016, at 6:26 AM, linjia at ruijie.com.cn wrote: > > Yes?you are right @ Raghav R V, thx! > > However, i found the key param is ?hidden_layer_sizes=[2]?, I wonder if I misunderstand the meaning of parameter of hidden_layer_sizes? > > Is it related to the topic : http://stackoverflow.com/questions/36819287/mlp-classifier-of-scikit-neuralnetwork-not-working-for-xor > > > ???: scikit-learn [mailto:scikit-learn-bounces+linjia=ruijie.com.cn at python.org] ?? Raghav R V > ????: 2016?11?23? 19:04 > ???: Scikit-learn user and developer mailing list > ??: Re: [scikit-learn] question about using sklearn.neural_network.MLPClassifier? > > Hi, > > If you keep everything at their default values, it seems to work - > > ```py > from sklearn.neural_network import MLPClassifier > X = [[0, 0], [0, 1], [1, 0], [1, 1]] > y = [0, 1, 1, 0] > clf = MLPClassifier(max_iter=1000) > clf.fit(X, y) > res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) > print(res) > ``` > > On Wed, Nov 23, 2016 at 10:27 AM, wrote: > Hi everyone > > I try to use sklearn.neural_network.MLPClassifier to test the XOR operation, but I found the result is not satisfied. The following is code, can you tell me if I use the lib incorrectly? > > from sklearn.neural_network import MLPClassifier > X = [[0, 0], [0, 1], [1, 0], [1, 1]] > y = [0, 1, 1, 0] > clf = MLPClassifier(solver='adam', activation='logistic', alpha=1e-3, hidden_layer_sizes=(2,), max_iter=1000) > clf.fit(X, y) > res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) > print(res) > > > #result is [0 0 0 0], score is 0.5 > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -- > Raghav RV > https://github.com/raghavrv > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Unknown-2.png Type: image/png Size: 9601 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Unknown-1.png Type: image/png Size: 10222 bytes Desc: not available URL: From deshpande.jaidev at gmail.com Thu Nov 24 03:00:36 2016 From: deshpande.jaidev at gmail.com (Jaidev Deshpande) Date: Thu, 24 Nov 2016 08:00:36 +0000 Subject: [scikit-learn] Specifying exceptions to ParameterGrid In-Reply-To: <58356FF0.4030209@gmail.com> References: <58356FF0.4030209@gmail.com> Message-ID: On Wed, 23 Nov 2016 at 19:05 Roman Yurchak wrote: > Hi Jaidev, > > well, `param_grid` in GridSearchCV can also be a list of dictionaries, > so you could directly specify the cases you are interested in (instead > of the full grid - exceptions), which might be simpler? > Actually now that I think of it, I don't know if it will be necessarily simpler. What if I have a massive grid and only few exceptions? Enumerating the complement of that small subset would be much more expensive than specifying the exceptions. What do you think? > > On 23/11/16 11:15, Jaidev Deshpande wrote: > > Hi, > > > > Sometimes when using GridSearchCV, I realize that in the grid there are > > certain combinations of hyperparameters that are either incompatible or > > redundant. For example, when using an MLP, if I specify the following > grid: > > > > grid = {'solver': ['sgd', 'adam'], 'learning_rate': ['constant', > > 'invscaling', 'adaptive']} > > > > then it yields the following ParameterGrid: > > > > [{'learning_rate': 'constant', 'solver': 'sgd'}, > > {'learning_rate': 'constant', 'solver': 'adam'}, > > {'learning_rate': 'invscaling', 'solver': 'sgd'}, > > {'learning_rate': 'invscaling', 'solver': 'adam'}, > > {'learning_rate': 'adaptive', 'solver': 'sgd'}, > > {'learning_rate': 'adaptive', 'solver': 'adam'}] > > > > Now, three of these are redundant, since learning_rate is used only for > > the sgd solver. Ideally I'd like to specify these cases upfront, and for > > that I have a simple hack > > ( > https://github.com/jaidevd/jarvis/blob/master/jarvis/cross_validation.py#L38 > ). > > Using that yields a ParameterGrid as follows: > > > > [{'learning_rate': 'constant', 'solver': 'adam'}, > > {'learning_rate': 'invscaling', 'solver': 'adam'}, > > {'learning_rate': 'adaptive', 'solver': 'adam'}] > > > > which is then simply removed from the original ParameterGrid. > > > > I wonder if there's a simpler way of doing this. Would it help if we had > > an additional parameter (something like "grid_exceptions") in > > GridSearchCV, which would remove these dicts from the list of parameters? > > > > Thanks > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From linjia at ruijie.com.cn Thu Nov 24 20:08:34 2016 From: linjia at ruijie.com.cn (linjia at ruijie.com.cn) Date: Fri, 25 Nov 2016 01:08:34 +0000 Subject: [scikit-learn] =?utf-8?b?562U5aSNOiAgIHF1ZXN0aW9uIGFib3V0IHVzaW5n?= =?utf-8?q?_sklearn=2Eneural=5Fnetwork=2EMLPClassifier=EF=BC=9F?= In-Reply-To: <4E63AF55-B6EA-4505-827F-B1D69D3F458B@gmail.com> References: <265E382B26F78742B972BE038FD292C6A79A86@fzex2.ruijie.com.cn> <265E382B26F78742B972BE038FD292C6A79DA5@fzex2.ruijie.com.cn> <4E63AF55-B6EA-4505-827F-B1D69D3F458B@gmail.com> Message-ID: <265E382B26F78742B972BE038FD292C6A7A21F@fzex2.ruijie.com.cn> @ Sebastian Raschka thanks for your analyzing , here is another question, when I use neural network lib routine, can I save the trained network for use at the next time? Just like the following: Foo1.py ? Clf.fit(x,y) Result_network = clf.save() ? Foo2.py ? Clf = Load(result_network) Res = Clf.predict(newsample) ? So I needn?t fit the train-set everytime ???: scikit-learn [mailto:scikit-learn-bounces+linjia=ruijie.com.cn at python.org] ?? Sebastian Raschka ????: 2016?11?24? 3:06 ???: Scikit-learn user and developer mailing list ??: Re: [scikit-learn] question about using sklearn.neural_network.MLPClassifier? If you keep everything at their default values, it seems to work - ```py from sklearn.neural_network import MLPClassifier X = [[0, 0], [0, 1], [1, 0], [1, 1]] y = [0, 1, 1, 0] clf = MLPClassifier(max_iter=1000) clf.fit(X, y) res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) print(res) ``` The default is set 100 units in the hidden layer, but theoretically, it should work with 2 hidden logistic units (I think that?s the typical textbook/class example). I think what happens is that it gets stuck in local minima depending on the random weight initialization. E.g., the following works just fine: from sklearn.neural_network import MLPClassifier X = [[0, 0], [0, 1], [1, 0], [1, 1]] y = [0, 1, 1, 0] clf = MLPClassifier(solver='lbfgs', activation='logistic', alpha=0.0, hidden_layer_sizes=(2,), learning_rate_init=0.1, max_iter=1000, random_state=20) clf.fit(X, y) res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) print(res) print(clf.loss_) but changing the random seed to 1 leads to: [0 1 1 1] 0.34660921283 For comparison, I used a more vanilla MLP (1 hidden layer with 2 units and logistic activation as well; https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch12/ch12.ipynb), essentially resulting in the same problem: [cid:image001.png at 01D246FB.965B30E0][cid:image002.png at 01D246FB.965B30E0] On Nov 23, 2016, at 6:26 AM, linjia at ruijie.com.cn wrote: Yes?you are right @ Raghav R V, thx! However, i found the key param is ?hidden_layer_sizes=[2]?, I wonder if I misunderstand the meaning of parameter of hidden_layer_sizes? Is it related to the topic : http://stackoverflow.com/questions/36819287/mlp-classifier-of-scikit-neuralnetwork-not-working-for-xor ???: scikit-learn [mailto:scikit-learn-bounces+linjia=ruijie.com.cn at python.org] ?? Raghav R V ????: 2016?11?23? 19:04 ???: Scikit-learn user and developer mailing list ??: Re: [scikit-learn] question about using sklearn.neural_network.MLPClassifier? Hi, If you keep everything at their default values, it seems to work - ```py from sklearn.neural_network import MLPClassifier X = [[0, 0], [0, 1], [1, 0], [1, 1]] y = [0, 1, 1, 0] clf = MLPClassifier(max_iter=1000) clf.fit(X, y) res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) print(res) ``` On Wed, Nov 23, 2016 at 10:27 AM, > wrote: Hi everyone I try to use sklearn.neural_network.MLPClassifier to test the XOR operation, but I found the result is not satisfied. The following is code, can you tell me if I use the lib incorrectly? from sklearn.neural_network import MLPClassifier X = [[0, 0], [0, 1], [1, 0], [1, 1]] y = [0, 1, 1, 0] clf = MLPClassifier(solver='adam', activation='logistic', alpha=1e-3, hidden_layer_sizes=(2,), max_iter=1000) clf.fit(X, y) res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) print(res) #result is [0 0 0 0], score is 0.5 _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -- Raghav RV https://github.com/raghavrv _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 12189 bytes Desc: image001.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 12879 bytes Desc: image002.png URL: From se.raschka at gmail.com Thu Nov 24 21:51:20 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Thu, 24 Nov 2016 21:51:20 -0500 Subject: [scikit-learn] =?utf-8?b?562U5aSNOiAgIHF1ZXN0aW9uIGFib3V0IHVz?= =?utf-8?q?ing_sklearn=2Eneural=5Fnetwork=2EMLPClassifier=EF=BC=9F?= In-Reply-To: <265E382B26F78742B972BE038FD292C6A7A21F@fzex2.ruijie.com.cn> References: <265E382B26F78742B972BE038FD292C6A79A86@fzex2.ruijie.com.cn> <265E382B26F78742B972BE038FD292C6A79DA5@fzex2.ruijie.com.cn> <4E63AF55-B6EA-4505-827F-B1D69D3F458B@gmail.com> <265E382B26F78742B972BE038FD292C6A7A21F@fzex2.ruijie.com.cn> Message-ID: <06B1DFA4-19BE-4815-8204-2787749CA81C@gmail.com> > here is another question, when I use neural network lib routine, can I save the trained network for use at the next time? Maybe have a look at the model persistence section at http://scikit-learn.org/stable/modules/model_persistence.html or http://cmry.github.io/notes/serialize Cheers, Sebastian > On Nov 24, 2016, at 8:08 PM, linjia at ruijie.com.cn wrote: > > @ Sebastian Raschka > thanks for your analyzing , > here is another question, when I use neural network lib routine, can I save the trained network for use at the next time? > Just like the following: > > Foo1.py > ? > Clf.fit(x,y) > Result_network = clf.save() > ? > > Foo2.py > ? > Clf = Load(result_network) > Res = Clf.predict(newsample) > ? > > So I needn?t fit the train-set everytime > ???: scikit-learn [mailto:scikit-learn-bounces+linjia=ruijie.com.cn at python.org] ?? Sebastian Raschka > ????: 2016?11?24? 3:06 > ???: Scikit-learn user and developer mailing list > ??: Re: [scikit-learn] question about using sklearn.neural_network.MLPClassifier? > > If you keep everything at their default values, it seems to work - > > ```py > from sklearn.neural_network import MLPClassifier > X = [[0, 0], [0, 1], [1, 0], [1, 1]] > y = [0, 1, 1, 0] > clf = MLPClassifier(max_iter=1000) > clf.fit(X, y) > res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) > print(res) > ``` > > The default is set 100 units in the hidden layer, but theoretically, it should work with 2 hidden logistic units (I think that?s the typical textbook/class example). I think what happens is that it gets stuck in local minima depending on the random weight initialization. E.g., the following works just fine: > > from sklearn.neural_network import MLPClassifier > X = [[0, 0], [0, 1], [1, 0], [1, 1]] > y = [0, 1, 1, 0] > clf = MLPClassifier(solver='lbfgs', > activation='logistic', > alpha=0.0, > hidden_layer_sizes=(2,), > learning_rate_init=0.1, > max_iter=1000, > random_state=20) > clf.fit(X, y) > res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) > print(res) > print(clf.loss_) > > > but changing the random seed to 1 leads to: > > [0 1 1 1] > 0.34660921283 > > For comparison, I used a more vanilla MLP (1 hidden layer with 2 units and logistic activation as well; https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch12/ch12.ipynb), essentially resulting in the same problem: > > > > > > > > On Nov 23, 2016, at 6:26 AM, linjia at ruijie.com.cn wrote: > > Yes?you are right @ Raghav R V, thx! > > However, i found the key param is ?hidden_layer_sizes=[2]?, I wonder if I misunderstand the meaning of parameter of hidden_layer_sizes? > > Is it related to the topic : http://stackoverflow.com/questions/36819287/mlp-classifier-of-scikit-neuralnetwork-not-working-for-xor > > > ???: scikit-learn [mailto:scikit-learn-bounces+linjia=ruijie.com.cn at python.org] ?? Raghav R V > ????: 2016?11?23? 19:04 > ???: Scikit-learn user and developer mailing list > ??: Re: [scikit-learn] question about using sklearn.neural_network.MLPClassifier? > > Hi, > > If you keep everything at their default values, it seems to work - > > ```py > from sklearn.neural_network import MLPClassifier > X = [[0, 0], [0, 1], [1, 0], [1, 1]] > y = [0, 1, 1, 0] > clf = MLPClassifier(max_iter=1000) > clf.fit(X, y) > res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) > print(res) > ``` > > On Wed, Nov 23, 2016 at 10:27 AM, wrote: > Hi everyone > > I try to use sklearn.neural_network.MLPClassifier to test the XOR operation, but I found the result is not satisfied. The following is code, can you tell me if I use the lib incorrectly? > > from sklearn.neural_network import MLPClassifier > X = [[0, 0], [0, 1], [1, 0], [1, 1]] > y = [0, 1, 1, 0] > clf = MLPClassifier(solver='adam', activation='logistic', alpha=1e-3, hidden_layer_sizes=(2,), max_iter=1000) > clf.fit(X, y) > res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) > print(res) > > > #result is [0 0 0 0], score is 0.5 > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -- > Raghav RV > https://github.com/raghavrv > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From linjia at ruijie.com.cn Thu Nov 24 21:56:04 2016 From: linjia at ruijie.com.cn (linjia at ruijie.com.cn) Date: Fri, 25 Nov 2016 02:56:04 +0000 Subject: [scikit-learn] =?utf-8?b?562U5aSNOiAg562U5aSNOiAgIHF1ZXN0aW9uIGFi?= =?utf-8?q?out_using_sklearn=2Eneural=5Fnetwork=2EMLPClassifier=EF=BC=9F?= In-Reply-To: <06B1DFA4-19BE-4815-8204-2787749CA81C@gmail.com> References: <265E382B26F78742B972BE038FD292C6A79A86@fzex2.ruijie.com.cn> <265E382B26F78742B972BE038FD292C6A79DA5@fzex2.ruijie.com.cn> <4E63AF55-B6EA-4505-827F-B1D69D3F458B@gmail.com> <265E382B26F78742B972BE038FD292C6A7A21F@fzex2.ruijie.com.cn> <06B1DFA4-19BE-4815-8204-2787749CA81C@gmail.com> Message-ID: <265E382B26F78742B972BE038FD292C6A7A26F@fzex2.ruijie.com.cn> Yes, it is what I need, Thanks you very much! -----????----- ???: scikit-learn [mailto:scikit-learn-bounces+linjia=ruijie.com.cn at python.org] ?? Sebastian Raschka ????: 2016?11?25? 10:51 ???: Scikit-learn user and developer mailing list ??: Re: [scikit-learn] ??: question about using sklearn.neural_network.MLPClassifier? > here is another question, when I use neural network lib routine, can I save the trained network for use at the next time? Maybe have a look at the model persistence section at http://scikit-learn.org/stable/modules/model_persistence.html or http://cmry.github.io/notes/serialize Cheers, Sebastian > On Nov 24, 2016, at 8:08 PM, linjia at ruijie.com.cn wrote: > > @ Sebastian Raschka > thanks for your analyzing , > here is another question, when I use neural network lib routine, can I save the trained network for use at the next time? > Just like the following: > > Foo1.py > ? > Clf.fit(x,y) > Result_network = clf.save() > ? > > Foo2.py > ? > Clf = Load(result_network) > Res = Clf.predict(newsample) > ? > > So I needn?t fit the train-set everytime > ???: scikit-learn > [mailto:scikit-learn-bounces+linjia=ruijie.com.cn at python.org] ?? > Sebastian Raschka > ????: 2016?11?24? 3:06 > ???: Scikit-learn user and developer mailing list > ??: Re: [scikit-learn] question about using > sklearn.neural_network.MLPClassifier? > > If you keep everything at their default values, it seems to work - > > ```py > from sklearn.neural_network import MLPClassifier X = [[0, 0], [0, 1], > [1, 0], [1, 1]] y = [0, 1, 1, 0] clf = MLPClassifier(max_iter=1000) > clf.fit(X, y) res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) > print(res) > ``` > > The default is set 100 units in the hidden layer, but theoretically, it should work with 2 hidden logistic units (I think that?s the typical textbook/class example). I think what happens is that it gets stuck in local minima depending on the random weight initialization. E.g., the following works just fine: > > from sklearn.neural_network import MLPClassifier X = [[0, 0], [0, 1], > [1, 0], [1, 1]] y = [0, 1, 1, 0] clf = MLPClassifier(solver='lbfgs', > activation='logistic', > alpha=0.0, > hidden_layer_sizes=(2,), > learning_rate_init=0.1, > max_iter=1000, > random_state=20) > clf.fit(X, y) > res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) > print(res) > print(clf.loss_) > > > but changing the random seed to 1 leads to: > > [0 1 1 1] > 0.34660921283 > > For comparison, I used a more vanilla MLP (1 hidden layer with 2 units and logistic activation as well; https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch12/ch12.ipynb), essentially resulting in the same problem: > > > > > > > > On Nov 23, 2016, at 6:26 AM, linjia at ruijie.com.cn wrote: > > Yes?you are right @ Raghav R V, thx! > > However, i found the key param is ?hidden_layer_sizes=[2]?, I wonder if I misunderstand the meaning of parameter of hidden_layer_sizes? > > Is it related to the topic : > http://stackoverflow.com/questions/36819287/mlp-classifier-of-scikit-n > euralnetwork-not-working-for-xor > > > ???: scikit-learn > [mailto:scikit-learn-bounces+linjia=ruijie.com.cn at python.org] ?? > Raghav R V > ????: 2016?11?23? 19:04 > ???: Scikit-learn user and developer mailing list > ??: Re: [scikit-learn] question about using > sklearn.neural_network.MLPClassifier? > > Hi, > > If you keep everything at their default values, it seems to work - > > ```py > from sklearn.neural_network import MLPClassifier X = [[0, 0], [0, 1], > [1, 0], [1, 1]] y = [0, 1, 1, 0] clf = MLPClassifier(max_iter=1000) > clf.fit(X, y) res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) > print(res) > ``` > > On Wed, Nov 23, 2016 at 10:27 AM, wrote: > Hi everyone > > I try to use sklearn.neural_network.MLPClassifier to test the XOR operation, but I found the result is not satisfied. The following is code, can you tell me if I use the lib incorrectly? > > from sklearn.neural_network import MLPClassifier X = [[0, 0], [0, 1], > [1, 0], [1, 1]] y = [0, 1, 1, 0] clf = MLPClassifier(solver='adam', > activation='logistic', alpha=1e-3, hidden_layer_sizes=(2,), > max_iter=1000) clf.fit(X, y) res = clf.predict([[0, 0], [0, 1], [1, > 0], [1, 1]]) > print(res) > > > #result is [0 0 0 0], score is 0.5 > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -- > Raghav RV > https://github.com/raghavrv > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn From linjia at ruijie.com.cn Fri Nov 25 06:38:40 2016 From: linjia at ruijie.com.cn (linjia at ruijie.com.cn) Date: Fri, 25 Nov 2016 11:38:40 +0000 Subject: [scikit-learn] =?utf-8?b?562U5aSNOiAg562U5aSNOiAgIHF1ZXN0aW9uIGFi?= =?utf-8?q?out_using_sklearn=2Eneural=5Fnetwork=2EMLPClassifier=EF=BC=9F?= In-Reply-To: <06B1DFA4-19BE-4815-8204-2787749CA81C@gmail.com> References: <265E382B26F78742B972BE038FD292C6A79A86@fzex2.ruijie.com.cn> <265E382B26F78742B972BE038FD292C6A79DA5@fzex2.ruijie.com.cn> <4E63AF55-B6EA-4505-827F-B1D69D3F458B@gmail.com> <265E382B26F78742B972BE038FD292C6A7A21F@fzex2.ruijie.com.cn> <06B1DFA4-19BE-4815-8204-2787749CA81C@gmail.com> Message-ID: <265E382B26F78742B972BE038FD292C6A7BD20@fzex2.ruijie.com.cn> Hello everyone, I use ' IsolationForest' to pick up the outlier data today and I notice there is a ' contamination ' parameter in IsolationForest function, and its default value is 0.1 = 10% So is there a way to pick the outlier without assigning the proportion of outliers in the data set? For example, in dataset [2,3,2,4,2,3,1,2,3,1,2, 999, 2,3,2,1,2,3], we can easily pick the '999' as an outlier entry out of the set according to the consciousness And I read some paper about outlier detect recently, many of them need number of outlier and distance as input parameter in advance, is there algorithm more intelligently ? -----????----- ???: scikit-learn [mailto:scikit-learn-bounces+linjia=ruijie.com.cn at python.org] ?? Sebastian Raschka ????: 2016?11?25? 10:51 ???: Scikit-learn user and developer mailing list ??: Re: [scikit-learn] ??: question about using sklearn.neural_network.MLPClassifier? > here is another question, when I use neural network lib routine, can I save the trained network for use at the next time? Maybe have a look at the model persistence section at http://scikit-learn.org/stable/modules/model_persistence.html or http://cmry.github.io/notes/serialize Cheers, Sebastian > On Nov 24, 2016, at 8:08 PM, linjia at ruijie.com.cn wrote: > > @ Sebastian Raschka > thanks for your analyzing , > here is another question, when I use neural network lib routine, can I save the trained network for use at the next time? > Just like the following: > > Foo1.py > ? > Clf.fit(x,y) > Result_network = clf.save() > ? > > Foo2.py > ? > Clf = Load(result_network) > Res = Clf.predict(newsample) > ? > > So I needn?t fit the train-set everytime > ???: scikit-learn > [mailto:scikit-learn-bounces+linjia=ruijie.com.cn at python.org] ?? > Sebastian Raschka > ????: 2016?11?24? 3:06 > ???: Scikit-learn user and developer mailing list > ??: Re: [scikit-learn] question about using > sklearn.neural_network.MLPClassifier? > > If you keep everything at their default values, it seems to work - > > ```py > from sklearn.neural_network import MLPClassifier X = [[0, 0], [0, 1], > [1, 0], [1, 1]] y = [0, 1, 1, 0] clf = MLPClassifier(max_iter=1000) > clf.fit(X, y) res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) > print(res) > ``` > > The default is set 100 units in the hidden layer, but theoretically, it should work with 2 hidden logistic units (I think that?s the typical textbook/class example). I think what happens is that it gets stuck in local minima depending on the random weight initialization. E.g., the following works just fine: > > from sklearn.neural_network import MLPClassifier X = [[0, 0], [0, 1], > [1, 0], [1, 1]] y = [0, 1, 1, 0] clf = MLPClassifier(solver='lbfgs', > activation='logistic', > alpha=0.0, > hidden_layer_sizes=(2,), > learning_rate_init=0.1, > max_iter=1000, > random_state=20) > clf.fit(X, y) > res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) > print(res) > print(clf.loss_) > > > but changing the random seed to 1 leads to: > > [0 1 1 1] > 0.34660921283 > > For comparison, I used a more vanilla MLP (1 hidden layer with 2 units and logistic activation as well; https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch12/ch12.ipynb), essentially resulting in the same problem: > > > > > > > > On Nov 23, 2016, at 6:26 AM, linjia at ruijie.com.cn wrote: > > Yes?you are right @ Raghav R V, thx! > > However, i found the key param is ?hidden_layer_sizes=[2]?, I wonder if I misunderstand the meaning of parameter of hidden_layer_sizes? > > Is it related to the topic : > http://stackoverflow.com/questions/36819287/mlp-classifier-of-scikit-n > euralnetwork-not-working-for-xor > > > ???: scikit-learn > [mailto:scikit-learn-bounces+linjia=ruijie.com.cn at python.org] ?? > Raghav R V > ????: 2016?11?23? 19:04 > ???: Scikit-learn user and developer mailing list > ??: Re: [scikit-learn] question about using > sklearn.neural_network.MLPClassifier? > > Hi, > > If you keep everything at their default values, it seems to work - > > ```py > from sklearn.neural_network import MLPClassifier X = [[0, 0], [0, 1], > [1, 0], [1, 1]] y = [0, 1, 1, 0] clf = MLPClassifier(max_iter=1000) > clf.fit(X, y) res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) > print(res) > ``` > > On Wed, Nov 23, 2016 at 10:27 AM, wrote: > Hi everyone > > I try to use sklearn.neural_network.MLPClassifier to test the XOR operation, but I found the result is not satisfied. The following is code, can you tell me if I use the lib incorrectly? > > from sklearn.neural_network import MLPClassifier X = [[0, 0], [0, 1], > [1, 0], [1, 1]] y = [0, 1, 1, 0] clf = MLPClassifier(solver='adam', > activation='logistic', alpha=1e-3, hidden_layer_sizes=(2,), > max_iter=1000) clf.fit(X, y) res = clf.predict([[0, 0], [0, 1], [1, > 0], [1, 1]]) > print(res) > > > #result is [0 0 0 0], score is 0.5 > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -- > Raghav RV > https://github.com/raghavrv > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn From rth.yurchak at gmail.com Fri Nov 25 09:52:12 2016 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Fri, 25 Nov 2016 15:52:12 +0100 Subject: [scikit-learn] Specifying exceptions to ParameterGrid In-Reply-To: References: <58356FF0.4030209@gmail.com> Message-ID: <5838501C.8080400@gmail.com> On 24/11/16 09:00, Jaidev Deshpande wrote: > > well, `param_grid` in GridSearchCV can also be a list of dictionaries, > so you could directly specify the cases you are interested in (instead > of the full grid - exceptions), which might be simpler? > > > Actually now that I think of it, I don't know if it will be necessarily > simpler. What if I have a massive grid and only few exceptions? > Enumerating the complement of that small subset would be much more > expensive than specifying the exceptions. The solution indicated by Raghav is most concise if that works for you. Otherwise, in general, if you want to define the parameters as the full grid with a few exceptions, without changing the GirdSearchCV API, you could always try something like, ``` from sklearn.model_selection import GridSearchCV, ParameterGrid from sklearn.neural_network import MLPClassifier grid_full = {'solver': ['sgd', 'adam'], 'learning_rate': ['constant', 'invscaling', 'adaptive']} def exception_handler(args): # custom function shaping the domain of valid parameters if args['solver'] == 'adam' and args['learning_rate'] != 'constant': return False else: return True def wrap_strings(args): # all values of dicts provided to GridSearchCV must be lists return {key: [val] for key, val in args.items()} grid_tmp = filter(exception_handler, ParameterGrid(grid_full)) grid = [wrap_strings(el) for el in grid_tmp] gs = GridSearchCV(MLPClassifier(random_state=42), param_grid=grid) ``` That's quite similar to what you were suggesting in the original post. From se.raschka at gmail.com Fri Nov 25 13:57:43 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Fri, 25 Nov 2016 13:57:43 -0500 Subject: [scikit-learn] =?utf-8?b?562U5aSNOiAg562U5aSNOiAgIHF1ZXN0aW9u?= =?utf-8?q?_about_using_sklearn=2Eneural=5Fnetwork=2EMLPClassifier?= =?utf-8?b?77yf?= In-Reply-To: <265E382B26F78742B972BE038FD292C6A7BD20@fzex2.ruijie.com.cn> References: <265E382B26F78742B972BE038FD292C6A79A86@fzex2.ruijie.com.cn> <265E382B26F78742B972BE038FD292C6A79DA5@fzex2.ruijie.com.cn> <4E63AF55-B6EA-4505-827F-B1D69D3F458B@gmail.com> <265E382B26F78742B972BE038FD292C6A7A21F@fzex2.ruijie.com.cn> <06B1DFA4-19BE-4815-8204-2787749CA81C@gmail.com> <265E382B26F78742B972BE038FD292C6A7BD20@fzex2.ruijie.com.cn> Message-ID: <3822A948-0FF5-488E-936B-6DF92EDE38EC@gmail.com> > many of them need number of outlier and distance as input parameter in advance, is there algorithm more intelligently ? With ?intelligently? you mean ?more automatic? (fewer hyperparameters to define manually)? In my opinion, ?outlier? is a highly context-specific definition, thus, it?s really up to you to decide what to count as an outlier or not for your application. E.g., a simple non-parametric approach would be to say that point P is an outlier if P > Q3 + 1.5 * IQR, or P < Q1 - 1.5 * IQR where Q1 and Q3 are the first and third quartile of the dataset, respectively, and IQR = interquartile range (Q3-Q1). Similarly you could use thresholds based on variance or standard deviation, etc. so that you don?t need to specify the number of outliers if that?s not what you want > On Nov 25, 2016, at 6:38 AM, linjia at ruijie.com.cn wrote: > > Hello everyone, > I use ' IsolationForest' to pick up the outlier data today and I notice there is a ' contamination ' parameter in IsolationForest function, and its default value is 0.1 = 10% > So is there a way to pick the outlier without assigning the proportion of outliers in the data set? > For example, in dataset [2,3,2,4,2,3,1,2,3,1,2, 999, 2,3,2,1,2,3], we can easily pick the '999' as an outlier entry out of the set according to the consciousness > And I read some paper about outlier detect recently, many of them need number of outlier and distance as input parameter in advance, is there algorithm more intelligently ? > > > > > > -----????----- > ???: scikit-learn [mailto:scikit-learn-bounces+linjia=ruijie.com.cn at python.org] ?? Sebastian Raschka > ????: 2016?11?25? 10:51 > ???: Scikit-learn user and developer mailing list > ??: Re: [scikit-learn] ??: question about using sklearn.neural_network.MLPClassifier? > >> here is another question, when I use neural network lib routine, can I save the trained network for use at the next time? > > > Maybe have a look at the model persistence section at http://scikit-learn.org/stable/modules/model_persistence.html or http://cmry.github.io/notes/serialize > > Cheers, > Sebastian > > >> On Nov 24, 2016, at 8:08 PM, linjia at ruijie.com.cn wrote: >> >> @ Sebastian Raschka >> thanks for your analyzing , >> here is another question, when I use neural network lib routine, can I save the trained network for use at the next time? >> Just like the following: >> >> Foo1.py >> ? >> Clf.fit(x,y) >> Result_network = clf.save() >> ? >> >> Foo2.py >> ? >> Clf = Load(result_network) >> Res = Clf.predict(newsample) >> ? >> >> So I needn?t fit the train-set everytime >> ???: scikit-learn >> [mailto:scikit-learn-bounces+linjia=ruijie.com.cn at python.org] ?? >> Sebastian Raschka >> ????: 2016?11?24? 3:06 >> ???: Scikit-learn user and developer mailing list >> ??: Re: [scikit-learn] question about using >> sklearn.neural_network.MLPClassifier? >> >> If you keep everything at their default values, it seems to work - >> >> ```py >> from sklearn.neural_network import MLPClassifier X = [[0, 0], [0, 1], >> [1, 0], [1, 1]] y = [0, 1, 1, 0] clf = MLPClassifier(max_iter=1000) >> clf.fit(X, y) res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) >> print(res) >> ``` >> >> The default is set 100 units in the hidden layer, but theoretically, it should work with 2 hidden logistic units (I think that?s the typical textbook/class example). I think what happens is that it gets stuck in local minima depending on the random weight initialization. E.g., the following works just fine: >> >> from sklearn.neural_network import MLPClassifier X = [[0, 0], [0, 1], >> [1, 0], [1, 1]] y = [0, 1, 1, 0] clf = MLPClassifier(solver='lbfgs', >> activation='logistic', >> alpha=0.0, >> hidden_layer_sizes=(2,), >> learning_rate_init=0.1, >> max_iter=1000, >> random_state=20) >> clf.fit(X, y) >> res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) >> print(res) >> print(clf.loss_) >> >> >> but changing the random seed to 1 leads to: >> >> [0 1 1 1] >> 0.34660921283 >> >> For comparison, I used a more vanilla MLP (1 hidden layer with 2 units and logistic activation as well; https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch12/ch12.ipynb), essentially resulting in the same problem: >> >> >> >> >> >> >> >> On Nov 23, 2016, at 6:26 AM, linjia at ruijie.com.cn wrote: >> >> Yes?you are right @ Raghav R V, thx! >> >> However, i found the key param is ?hidden_layer_sizes=[2]?, I wonder if I misunderstand the meaning of parameter of hidden_layer_sizes? >> >> Is it related to the topic : >> http://stackoverflow.com/questions/36819287/mlp-classifier-of-scikit-n >> euralnetwork-not-working-for-xor >> >> >> ???: scikit-learn >> [mailto:scikit-learn-bounces+linjia=ruijie.com.cn at python.org] ?? >> Raghav R V >> ????: 2016?11?23? 19:04 >> ???: Scikit-learn user and developer mailing list >> ??: Re: [scikit-learn] question about using >> sklearn.neural_network.MLPClassifier? >> >> Hi, >> >> If you keep everything at their default values, it seems to work - >> >> ```py >> from sklearn.neural_network import MLPClassifier X = [[0, 0], [0, 1], >> [1, 0], [1, 1]] y = [0, 1, 1, 0] clf = MLPClassifier(max_iter=1000) >> clf.fit(X, y) res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) >> print(res) >> ``` >> >> On Wed, Nov 23, 2016 at 10:27 AM, wrote: >> Hi everyone >> >> I try to use sklearn.neural_network.MLPClassifier to test the XOR operation, but I found the result is not satisfied. The following is code, can you tell me if I use the lib incorrectly? >> >> from sklearn.neural_network import MLPClassifier X = [[0, 0], [0, 1], >> [1, 0], [1, 1]] y = [0, 1, 1, 0] clf = MLPClassifier(solver='adam', >> activation='logistic', alpha=1e-3, hidden_layer_sizes=(2,), >> max_iter=1000) clf.fit(X, y) res = clf.predict([[0, 0], [0, 1], [1, >> 0], [1, 1]]) >> print(res) >> >> >> #result is [0 0 0 0], score is 0.5 >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> -- >> Raghav RV >> https://github.com/raghavrv >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From tommaso.costanzo01 at gmail.com Fri Nov 25 14:34:48 2016 From: tommaso.costanzo01 at gmail.com (Tommaso Costanzo) Date: Fri, 25 Nov 2016 14:34:48 -0500 Subject: [scikit-learn] Bayesian Gaussian Mixture Message-ID: Hi, I am facing some problem with the "BayesianGaussianMixture" function, but I do not know if it is because of my poor knowledge on this type of statistics or if it is something related to the algorithm. I have set of data of around 1000 to 4000 observation (every feature is a spectrum of around 200 point) so in the end I have n_samples = ~1000 and n_features = ~20. The good things is that I am getting the same results of KMeans however the "predict_proba" has value only of 0 or 1. I have wrote a small function to simulate my problem with random data that is reported below. The first 1/2 of the array has the point with a positive slope while the second 1/2 has a negative slope, so the cross in the middle. What I have seen is that for a small number of features I obtain good probability, but if the number of features increases (say 50) than the probability become only 0 or 1. Can someone help me in interpret this result? Here is the code I wrote with the generated random number, I'll generally run it with ncomponent=2 and nfeatures=5 or 10 or 50 or 100. I am not sure if it will work in every case is not very highly tested. I have also attached as a file! ########################################################################## import numpy as np from sklearn.mixture import GaussianMixture, BayesianGaussianMixture import matplotlib.pyplot as plt def test_bgm(ncomponent, nfeatures): temp = np.random.randn(500,nfeatures) temp = temp + np.arange(-1,1, 2.0/nfeatures) temp1 = np.random.randn(400,nfeatures) temp1 = temp1 + np.arange(1,-1, (-2.0/nfeatures)) X = np.vstack((temp, temp1)) bgm = BayesianGaussianMixture(ncomponent,degrees_of_freedom_prior=nfeatures*2).fit(X) bgm_proba = bgm.predict_proba(X) bgm_labels = bgm.predict(X) plt.figure(-1) plt.imshow(bgm_labels.reshape(30,-1), origin='lower', interpolatio='none') plt.colorbar() for i in np.arange(0,ncomponent): plt.figure(i) plt.imshow(bgm_proba[:,i].reshape(30,-1), origin='lower', interpolatio='none') plt.colorbar() plt.show() ############################################################################## Thank you in advance Tommaso -- Please do NOT send Microsoft Office Attachments: http://www.gnu.org/philosophy/no-word-attachments.html -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: GaussianTest.py Type: text/x-python Size: 844 bytes Desc: not available URL: From jmschreiber91 at gmail.com Fri Nov 25 21:32:20 2016 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Fri, 25 Nov 2016 18:32:20 -0800 Subject: [scikit-learn] Bayesian Gaussian Mixture In-Reply-To: References: Message-ID: Typically this means that the model is so confident in its predictions it does not believe it possible for the sample to come from the other component. Do you get the same results with a regular GaussianMixture? On Fri, Nov 25, 2016 at 11:34 AM, Tommaso Costanzo < tommaso.costanzo01 at gmail.com> wrote: > Hi, > > I am facing some problem with the "BayesianGaussianMixture" function, but > I do not know if it is because of my poor knowledge on this type of > statistics or if it is something related to the algorithm. I have set of > data of around 1000 to 4000 observation (every feature is a spectrum of > around 200 point) so in the end I have n_samples = ~1000 and n_features = > ~20. The good things is that I am getting the same results of KMeans > however the "predict_proba" has value only of 0 or 1. > > I have wrote a small function to simulate my problem with random data that > is reported below. The first 1/2 of the array has the point with a positive > slope while the second 1/2 has a negative slope, so the cross in the > middle. What I have seen is that for a small number of features I obtain > good probability, but if the number of features increases (say 50) than the > probability become only 0 or 1. > Can someone help me in interpret this result? > > Here is the code I wrote with the generated random number, I'll generally > run it with ncomponent=2 and nfeatures=5 or 10 or 50 or 100. I am not sure > if it will work in every case is not very highly tested. I have also > attached as a file! > > ########################################################################## > import numpy as np > > from sklearn.mixture import GaussianMixture, BayesianGaussianMixture > > import matplotlib.pyplot as plt > > > > def test_bgm(ncomponent, nfeatures): > > temp = np.random.randn(500,nfeatures) > > temp = temp + np.arange(-1,1, 2.0/nfeatures) > > temp1 = np.random.randn(400,nfeatures) > > temp1 = temp1 + np.arange(1,-1, (-2.0/nfeatures)) > > X = np.vstack((temp, temp1)) > > > > bgm = BayesianGaussianMixture(ncomponent,degrees_of_freedom_ > prior=nfeatures*2).fit(X) > bgm_proba = bgm.predict_proba(X) > > bgm_labels = bgm.predict(X) > > > > plt.figure(-1) > > plt.imshow(bgm_labels.reshape(30,-1), origin='lower', > interpolatio='none') > plt.colorbar() > > > > for i in np.arange(0,ncomponent): > > plt.figure(i) > > plt.imshow(bgm_proba[:,i].reshape(30,-1), origin='lower', > interpolatio='none') > plt.colorbar() > > > > plt.show() > ############################################################ > ################## > > Thank you in advance > Tommaso > > > -- > Please do NOT send Microsoft Office Attachments: > http://www.gnu.org/philosophy/no-word-attachments.html > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From deshpande.jaidev at gmail.com Sat Nov 26 02:04:46 2016 From: deshpande.jaidev at gmail.com (Jaidev Deshpande) Date: Sat, 26 Nov 2016 07:04:46 +0000 Subject: [scikit-learn] Specifying exceptions to ParameterGrid In-Reply-To: <5838501C.8080400@gmail.com> References: <58356FF0.4030209@gmail.com> <5838501C.8080400@gmail.com> Message-ID: On Fri, 25 Nov 2016 at 20:24 Roman Yurchak wrote: > On 24/11/16 09:00, Jaidev Deshpande wrote: > > > > well, `param_grid` in GridSearchCV can also be a list of > dictionaries, > > so you could directly specify the cases you are interested in > (instead > > of the full grid - exceptions), which might be simpler? > > > > > > Actually now that I think of it, I don't know if it will be necessarily > > simpler. What if I have a massive grid and only few exceptions? > > Enumerating the complement of that small subset would be much more > > expensive than specifying the exceptions. > The solution indicated by Raghav is most concise if that works for you. > > Otherwise, in general, if you want to define the parameters as the full > grid with a few exceptions, without changing the GirdSearchCV API, you > could always try something like, > > ``` > from sklearn.model_selection import GridSearchCV, ParameterGrid > from sklearn.neural_network import MLPClassifier > > grid_full = {'solver': ['sgd', 'adam'], > 'learning_rate': ['constant', 'invscaling', 'adaptive']} > > def exception_handler(args): > # custom function shaping the domain of valid parameters > if args['solver'] == 'adam' and args['learning_rate'] != 'constant': > return False > else: > return True > > def wrap_strings(args): > # all values of dicts provided to GridSearchCV must be lists > return {key: [val] for key, val in args.items()} > > grid_tmp = filter(exception_handler, ParameterGrid(grid_full)) > grid = [wrap_strings(el) for el in grid_tmp] > > gs = GridSearchCV(MLPClassifier(random_state=42), > param_grid=grid) > ``` > That's quite similar to what you were suggesting in the original post. > Yes, also a lot more concise I guess. This way I just have to keep writing an exception handler instead of subclassing. Thanks! > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tommaso.costanzo01 at gmail.com Sun Nov 27 11:47:38 2016 From: tommaso.costanzo01 at gmail.com (Tommaso Costanzo) Date: Sun, 27 Nov 2016 11:47:38 -0500 Subject: [scikit-learn] Bayesian Gaussian Mixture In-Reply-To: References: Message-ID: Hi Jacob, I have just changed my code from BayesianGaussianMixture to GaussianMixture, and the results is the same. I attached here the picture of the first component when I runned the code with 5, 10, and 50 nfeatures and 2 components. In my short test function I expect to have point that they can be in one component as well as another has visible for small number of nfeatures, but 0 1 for nfeatures >50 does not sounds correct. Seems that is just related to the size of the model and in particular to the number of features. With the BayesianGaussianMixture I have seen that it is sligthly better to increase the degree of freedoms to 2*nfeatures instead of the default nfeatures. However, this does not change the result when the nfeatures are 50 or more. Thank you in advance Tommaso 2016-11-25 21:32 GMT-05:00 Jacob Schreiber : > Typically this means that the model is so confident in its predictions it > does not believe it possible for the sample to come from the other > component. Do you get the same results with a regular GaussianMixture? > > On Fri, Nov 25, 2016 at 11:34 AM, Tommaso Costanzo < > tommaso.costanzo01 at gmail.com> wrote: > >> Hi, >> >> I am facing some problem with the "BayesianGaussianMixture" function, but >> I do not know if it is because of my poor knowledge on this type of >> statistics or if it is something related to the algorithm. I have set of >> data of around 1000 to 4000 observation (every feature is a spectrum of >> around 200 point) so in the end I have n_samples = ~1000 and n_features = >> ~20. The good things is that I am getting the same results of KMeans >> however the "predict_proba" has value only of 0 or 1. >> >> I have wrote a small function to simulate my problem with random data >> that is reported below. The first 1/2 of the array has the point with a >> positive slope while the second 1/2 has a negative slope, so the cross in >> the middle. What I have seen is that for a small number of features I >> obtain good probability, but if the number of features increases (say 50) >> than the probability become only 0 or 1. >> Can someone help me in interpret this result? >> >> Here is the code I wrote with the generated random number, I'll generally >> run it with ncomponent=2 and nfeatures=5 or 10 or 50 or 100. I am not sure >> if it will work in every case is not very highly tested. I have also >> attached as a file! >> >> ############################################################ >> ############## >> import numpy as np >> >> from sklearn.mixture import GaussianMixture, >> BayesianGaussianMixture >> import matplotlib.pyplot as plt >> >> >> >> def test_bgm(ncomponent, nfeatures): >> >> temp = np.random.randn(500,nfeatures) >> >> temp = temp + np.arange(-1,1, 2.0/nfeatures) >> >> temp1 = np.random.randn(400,nfeatures) >> >> temp1 = temp1 + np.arange(1,-1, (-2.0/nfeatures)) >> >> X = np.vstack((temp, temp1)) >> >> >> >> bgm = BayesianGaussianMixture(ncomponent,degrees_of_freedom_prior=nfeatures*2).fit(X) >> >> bgm_proba = bgm.predict_proba(X) >> >> bgm_labels = bgm.predict(X) >> >> >> >> plt.figure(-1) >> >> plt.imshow(bgm_labels.reshape(30,-1), origin='lower', >> interpolatio='none') >> plt.colorbar() >> >> >> >> for i in np.arange(0,ncomponent): >> >> plt.figure(i) >> >> plt.imshow(bgm_proba[:,i].reshape(30,-1), origin='lower', >> interpolatio='none') >> plt.colorbar() >> >> >> >> plt.show() >> ############################################################ >> ################## >> >> Thank you in advance >> Tommaso >> >> >> -- >> Please do NOT send Microsoft Office Attachments: >> http://www.gnu.org/philosophy/no-word-attachments.html >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Please do NOT send Microsoft Office Attachments: http://www.gnu.org/philosophy/no-word-attachments.html -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: N_Features-5.png Type: image/png Size: 23257 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: N_Features-10.png Type: image/png Size: 21773 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: N_Features-50.png Type: image/png Size: 18618 bytes Desc: not available URL: From a.suchaneck at gmail.com Mon Nov 28 10:24:02 2016 From: a.suchaneck at gmail.com (Anton Suchaneck) Date: Mon, 28 Nov 2016 16:24:02 +0100 Subject: [scikit-learn] How to not recalculate transformer in a Pipeline? Message-ID: Hello! I use a 2-step Pipeline with an expensive transformer and a classification afterwards. On this I do GridSearchCV of the classifcation parameters. Now, theoretically GridSearchCV could know that I'm not touching any parameters of the transformer and avoid re-doing work by keeping the transformed X, right?! Currently, GridSearchCV will do a clean re-run of all Pipeline steps? Can you recommend the easiest way for me to use GridSearchCV+Pipeline while avoiding recomputation of all transformer steps whose parameters are not in the GridSearch? I realize this may be tricky, but any pointers to realize this most conveniently and compatible with sklearn would be highly appreciated! (The scoring has to be done on the initial data, so I cannot just manually transform beforehand.) Regards, Anton PS: If that all makes sense, is that a useful feature to include in sklearn? -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Nov 28 11:39:59 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 28 Nov 2016 11:39:59 -0500 Subject: [scikit-learn] How to not recalculate transformer in a Pipeline? In-Reply-To: References: Message-ID: Hey Anton. Yes, that would be great to have. There is no solution implemented in scikit-learn right now, but there are at least two ways that I know of. This (ancient and probably now defunct) pr: https://github.com/scikit-learn/scikit-learn/pull/3951 And using dask: http://matthewrocklin.com/blog/work/2016/07/12/dask-learn-part-1 Andy On 11/28/2016 10:24 AM, Anton Suchaneck wrote: > Hello! > > I use a 2-step Pipeline with an expensive transformer and a > classification afterwards. On this I do GridSearchCV of the > classifcation parameters. > > Now, theoretically GridSearchCV could know that I'm not touching any > parameters of the transformer and avoid re-doing work by keeping the > transformed X, right?! > Currently, GridSearchCV will do a clean re-run of all Pipeline steps? > > Can you recommend the easiest way for me to use GridSearchCV+Pipeline > while avoiding recomputation of all transformer steps whose parameters > are not in the GridSearch? I realize this may be tricky, but any > pointers to realize this most conveniently and compatible with sklearn > would be highly appreciated! > > (The scoring has to be done on the initial data, so I cannot just > manually transform beforehand.) > > Regards, > Anton > > PS: If that all makes sense, is that a useful feature to include in > sklearn? > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Mon Nov 28 10:46:35 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Mon, 28 Nov 2016 16:46:35 +0100 Subject: [scikit-learn] How to not recalculate transformer in a Pipeline? In-Reply-To: References: Message-ID: <20161128154635.GD1767895@phare.normalesup.org> I use joblib.Memory for this purpose. I think that including a meta-transformer that embeds a joblib.Memory would be a good addition to scikit-learn. From t3kcit at gmail.com Mon Nov 28 11:56:29 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 28 Nov 2016 11:56:29 -0500 Subject: [scikit-learn] Bayesian Gaussian Mixture In-Reply-To: References: Message-ID: <3bd52b79-f822-886e-e29c-bbe7487f42fd@gmail.com> Hi Tommaso. So what's the issue? The distributions are very distinct, so there is no confusion. The higher the dimensionality, the further apart the points are (compare the distance between (-1, 1) and (1, -1) to the one between (-1, -.5, 0, .5, 1) and (1, .5, 0, -.5, -1). I'm not sure what you mean by "the cross in the middle". You create two fixed points, one at np.arange(-1,1, 2.0/nfeatures) and one at np.arange(1,-1, (-2.0/nfeatures)). In high dimensions, these points are very far apart. Then you add standard normal noise to it. So this data is two perfect Gaussians. In low dimensions, they are "close together" so there is some confusion, in high dimensions, they are "far apart" so there is less confusion. Hth, Andy On 11/27/2016 11:47 AM, Tommaso Costanzo wrote: > Hi Jacob, > > I have just changed my code from BayesianGaussianMixture to > GaussianMixture, and the results is the same. I attached here the > picture of the first component when I runned the code with 5, 10, and > 50 nfeatures and 2 components. In my short test function I expect to > have point that they can be in one component as well as another has > visible for small number of nfeatures, but 0 1 for nfeatures >50 does > not sounds correct. Seems that is just related to the size of the > model and in particular to the number of features. With the > BayesianGaussianMixture I have seen that it is sligthly better to > increase the degree of freedoms to 2*nfeatures instead of the default > nfeatures. However, this does not change the result when the nfeatures > are 50 or more. > > Thank you in advance > Tommaso > > 2016-11-25 21:32 GMT-05:00 Jacob Schreiber >: > > Typically this means that the model is so confident in its > predictions it does not believe it possible for the sample to come > from the other component. Do you get the same results with a > regular GaussianMixture? > > On Fri, Nov 25, 2016 at 11:34 AM, Tommaso Costanzo > > wrote: > > Hi, > > I am facing some problem with the "BayesianGaussianMixture" > function, but I do not know if it is because of my poor > knowledge on this type of statistics or if it is something > related to the algorithm. I have set of data of around 1000 to > 4000 observation (every feature is a spectrum of around 200 > point) so in the end I have n_samples = ~1000 and n_features = > ~20. The good things is that I am getting the same results of > KMeans however the "predict_proba" has value only of 0 or 1. > > I have wrote a small function to simulate my problem with > random data that is reported below. The first 1/2 of the array > has the point with a positive slope while the second 1/2 has a > negative slope, so the cross in the middle. What I have seen > is that for a small number of features I obtain good > probability, but if the number of features increases (say 50) > than the probability become only 0 or 1. > Can someone help me in interpret this result? > > Here is the code I wrote with the generated random number, > I'll generally run it with ncomponent=2 and nfeatures=5 or 10 > or 50 or 100. I am not sure if it will work in every case is > not very highly tested. I have also attached as a file! > > ########################################################################## > import numpy as np > from sklearn.mixture import GaussianMixture, > BayesianGaussianMixture > import matplotlib.pyplot as plt > > def test_bgm(ncomponent, nfeatures): > temp = np.random.randn(500,nfeatures) > temp = temp + np.arange(-1,1, 2.0/nfeatures) > temp1 = np.random.randn(400,nfeatures) > temp1 = temp1 + np.arange(1,-1, (-2.0/nfeatures)) > X = np.vstack((temp, temp1)) > > bgm = > BayesianGaussianMixture(ncomponent,degrees_of_freedom_prior=nfeatures*2).fit(X) > > bgm_proba = bgm.predict_proba(X) > bgm_labels = bgm.predict(X) > > plt.figure(-1) > plt.imshow(bgm_labels.reshape(30,-1), origin='lower', > interpolatio='none') > plt.colorbar() > > for i in np.arange(0,ncomponent): > plt.figure(i) > plt.imshow(bgm_proba[:,i].reshape(30,-1), > origin='lower', interpolatio='none') > plt.colorbar() > > plt.show() > ############################################################################## > > Thank you in advance > Tommaso > > > -- > Please do NOT send Microsoft Office Attachments: > http://www.gnu.org/philosophy/no-word-attachments.html > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > -- > Please do NOT send Microsoft Office Attachments: > http://www.gnu.org/philosophy/no-word-attachments.html > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Nov 28 12:07:49 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 28 Nov 2016 12:07:49 -0500 Subject: [scikit-learn] How to not recalculate transformer in a Pipeline? In-Reply-To: <20161128154635.GD1767895@phare.normalesup.org> References: <20161128154635.GD1767895@phare.normalesup.org> Message-ID: On 11/28/2016 10:46 AM, Gael Varoquaux wrote: > I use joblib.Memory for this purpose. I think that including a > meta-transformer that embeds a joblib.Memory would be a good addition to > scikit-learn. To cache the result of "transform"? You still have to call "fit" multiple times, right? Or would you cache the return of "fit" as well as "transform"? Caching "fit" with joblib seems non-trivial. From gael.varoquaux at normalesup.org Mon Nov 28 12:15:26 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Mon, 28 Nov 2016 18:15:26 +0100 Subject: [scikit-learn] How to not recalculate transformer in a Pipeline? In-Reply-To: References: <20161128154635.GD1767895@phare.normalesup.org> Message-ID: <20161128171526.GI1767895@phare.normalesup.org> > Or would you cache the return of "fit" as well as "transform"? Caching fit rather than transform. Fit is usually the costly step. > Caching "fit" with joblib seems non-trivial. Why? Caching a function that takes the estimator and X and y should do it. The transformer would clone the estimator on fit, to avoid side-effects that would trigger recomputes. It's a pattern that I use often, I've just never coded a good transformer for it. On my usecases, it works very well, provided that everything is nicely seeded. Also, the persistence across sessions is a real time saver. From t3kcit at gmail.com Mon Nov 28 13:46:08 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 28 Nov 2016 13:46:08 -0500 Subject: [scikit-learn] How to not recalculate transformer in a Pipeline? In-Reply-To: <20161128171526.GI1767895@phare.normalesup.org> References: <20161128154635.GD1767895@phare.normalesup.org> <20161128171526.GI1767895@phare.normalesup.org> Message-ID: On 11/28/2016 12:15 PM, Gael Varoquaux wrote: >> Or would you cache the return of "fit" as well as "transform"? > Caching fit rather than transform. Fit is usually the costly step. > >> Caching "fit" with joblib seems non-trivial. > Why? Caching a function that takes the estimator and X and y should do > it. The transformer would clone the estimator on fit, to avoid > side-effects that would trigger recomputes. I guess so. You'd handle parameters using an estimator_params dict in init and pass that to the caching function? > > It's a pattern that I use often, I've just never coded a good transformer > for it. > > On my usecases, it works very well, provided that everything is nicely > seeded. Also, the persistence across sessions is a real time saver. Yeah for sure :) From gael.varoquaux at normalesup.org Mon Nov 28 13:51:21 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Mon, 28 Nov 2016 19:51:21 +0100 Subject: [scikit-learn] How to not recalculate transformer in a Pipeline? In-Reply-To: References: <20161128154635.GD1767895@phare.normalesup.org> <20161128171526.GI1767895@phare.normalesup.org> Message-ID: <20161128185121.GA2031543@phare.normalesup.org> On Mon, Nov 28, 2016 at 01:46:08PM -0500, Andreas Mueller wrote: > I guess so. You'd handle parameters using an estimator_params dict in init > and pass that to the caching function? I'd try to set on the estimator, before passing them to the function, as we do in standard scikit-learn, and joblib is clever enough to take that in account when given the estimator as a function of the function that is memoized. G From gael.varoquaux at normalesup.org Mon Nov 28 17:17:22 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Mon, 28 Nov 2016 23:17:22 +0100 Subject: [scikit-learn] How to not recalculate transformer in a Pipeline? In-Reply-To: <20161128185121.GA2031543@phare.normalesup.org> References: <20161128154635.GD1767895@phare.normalesup.org> <20161128171526.GI1767895@phare.normalesup.org> <20161128185121.GA2031543@phare.normalesup.org> Message-ID: <20161128221722.GG2031543@phare.normalesup.org> Actually, thinking a bit about this, the inconvenience with the pattern that I lay out below is that it adds an extra indirection in the parameter setting. One way to avoid this would be to have a subclass of the pipeline that includes memoizing. It would call a memoized version of fit. I think that it would be quite handy :). Should I open an issue on that? G On Mon, Nov 28, 2016 at 07:51:21PM +0100, Gael Varoquaux wrote: > On Mon, Nov 28, 2016 at 01:46:08PM -0500, Andreas Mueller wrote: > > I guess so. You'd handle parameters using an estimator_params dict in init > > and pass that to the caching function? > I'd try to set on the estimator, before passing them to the function, as we > do in standard scikit-learn, and joblib is clever enough to take that in > account when given the estimator as a function of the function that is > memoized. > G > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From joel.nothman at gmail.com Mon Nov 28 18:13:00 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 29 Nov 2016 10:13:00 +1100 Subject: [scikit-learn] How to not recalculate transformer in a Pipeline? In-Reply-To: <20161128221722.GG2031543@phare.normalesup.org> References: <20161128154635.GD1767895@phare.normalesup.org> <20161128171526.GI1767895@phare.normalesup.org> <20161128185121.GA2031543@phare.normalesup.org> <20161128221722.GG2031543@phare.normalesup.org> Message-ID: A few brief points of history: - We have had PRs #3951 and #2086 that build memoising into Pipeline in one way or another. - Andy and I have previously discussed alternative ways to set parameters to avoid indirection issues created by wrappers. This can be achieved by setting the parameter space on the estimator itself, or by indicating parameters to *SearchCV shallowly with respect to an estimator instance, rather than using an indirected path. See #5082 . - The indirection is in parameter setting as well as in retrieving model attributes. My remember branch gets around both indirections in creating a remember_transform wrapper, but it does so by hacking clone (as per #5080 ), and doing some other magic. On 29 November 2016 at 09:17, Gael Varoquaux wrote: > Actually, thinking a bit about this, the inconvenience with the pattern > that I lay out below is that it adds an extra indirection in the > parameter setting. One way to avoid this would be to have a subclass of > the pipeline that includes memoizing. It would call a memoized version of > fit. > > I think that it would be quite handy :). > > Should I open an issue on that? > > G > > On Mon, Nov 28, 2016 at 07:51:21PM +0100, Gael Varoquaux wrote: > > On Mon, Nov 28, 2016 at 01:46:08PM -0500, Andreas Mueller wrote: > > > I guess so. You'd handle parameters using an estimator_params dict in > init > > > and pass that to the caching function? > > > I'd try to set on the estimator, before passing them to the function, as > we > > do in standard scikit-learn, and joblib is clever enough to take that in > > account when given the estimator as a function of the function that is > > memoized. > > > G > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > -- > Gael Varoquaux > Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dani.homola at gmail.com Mon Nov 28 18:30:42 2016 From: dani.homola at gmail.com (Daniel Homola) Date: Mon, 28 Nov 2016 23:30:42 +0000 Subject: [scikit-learn] Problem with nested cross-validation example? Message-ID: Dear all, I was wondering if the following example code is valid: http://scikit-learn.org/stable/auto_examples/model_ selection/plot_nested_cross_validation_iris.html My understanding is, that the point of nested cross-validation is to prevent any data leakage from the inner grid-search/param optimization CV loop into the outer model evaluation CV loop. This could be achieved if the outer CV loop's test data is completely separated from the inner loop's CV, as shown here: https://mlr-org.github.io/mlr-tutorial/release/html/img/ nested_resampling.png The code in the above example however doesn't seem to achieve this in any way. Am I missing something here? Thanks a lot, dh -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Nov 28 19:06:43 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 29 Nov 2016 11:06:43 +1100 Subject: [scikit-learn] Problem with nested cross-validation example? In-Reply-To: References: Message-ID: Briefly: clf = GridSearchCV (estimator=svr, param_grid=p_grid, cv=inner_cv)nested_score = cross_val_score (clf, X=X_iris, y=y_iris, cv=outer_cv) Each train/test split in cross_val_score holds out test data. GridSearchCV then splits each train set into (inner-)train and validation sets. There is no leakage of test set knowledge from the outer loop into the grid search optimisation; no leakage of validation set knowledge into the SVR optimisation. The outer test data are reused as training data, but within each split are only used to measure generalisation error. Is that clear? On 29 November 2016 at 10:30, Daniel Homola wrote: > Dear all, > > > I was wondering if the following example code is valid: > > http://scikit-learn.org/stable/auto_examples/model_selection > /plot_nested_cross_validation_iris.html > > My understanding is, that the point of nested cross-validation is to > prevent any data leakage from the inner grid-search/param optimization CV > loop into the outer model evaluation CV loop. This could be achieved if the > outer CV loop's test data is completely separated from the inner loop's CV, > as shown here: > > https://mlr-org.github.io/mlr-tutorial/release/html/img/nest > ed_resampling.png > > > The code in the above example however doesn't seem to achieve this in any > way. > > > Am I missing something here? > > > Thanks a lot, > > dh > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Nov 28 19:07:15 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 29 Nov 2016 11:07:15 +1100 Subject: [scikit-learn] Problem with nested cross-validation example? In-Reply-To: References: Message-ID: If that clarifies, please offer changes to the example (as a pull request) that make this clearer. On 29 November 2016 at 11:06, Joel Nothman wrote: > Briefly: > > clf = GridSearchCV (estimator=svr, param_grid=p_grid, cv=inner_cv)nested_score = cross_val_score (clf, X=X_iris, y=y_iris, cv=outer_cv) > > > Each train/test split in cross_val_score holds out test data. GridSearchCV > then splits each train set into (inner-)train and validation sets. There is > no leakage of test set knowledge from the outer loop into the grid search > optimisation; no leakage of validation set knowledge into the SVR > optimisation. The outer test data are reused as training data, but within > each split are only used to measure generalisation error. > > Is that clear? > > On 29 November 2016 at 10:30, Daniel Homola wrote: > >> Dear all, >> >> >> I was wondering if the following example code is valid: >> >> http://scikit-learn.org/stable/auto_examples/model_selection >> /plot_nested_cross_validation_iris.html >> >> My understanding is, that the point of nested cross-validation is to >> prevent any data leakage from the inner grid-search/param optimization CV >> loop into the outer model evaluation CV loop. This could be achieved if the >> outer CV loop's test data is completely separated from the inner loop's CV, >> as shown here: >> >> https://mlr-org.github.io/mlr-tutorial/release/html/img/nest >> ed_resampling.png >> >> >> The code in the above example however doesn't seem to achieve this in any >> way. >> >> >> Am I missing something here? >> >> >> Thanks a lot, >> >> dh >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Mon Nov 28 20:52:24 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Mon, 28 Nov 2016 20:52:24 -0500 Subject: [scikit-learn] Problem with nested cross-validation example? In-Reply-To: References: Message-ID: <110F77EA-E55B-4425-9E80-2B29A3997C3E@gmail.com> On first glance, the image shown in the image and the code example seem to do/show the same thing? Maybe it would be worth adding an explanatory figure like this to the docs to clarify? > On Nov 28, 2016, at 7:07 PM, Joel Nothman wrote: > > If that clarifies, please offer changes to the example (as a pull request) that make this clearer. > > On 29 November 2016 at 11:06, Joel Nothman wrote: > Briefly: > > clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=inner_cv) > nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv) > > Each train/test split in cross_val_score holds out test data. GridSearchCV then splits each train set into (inner-)train and validation sets. There is no leakage of test set knowledge from the outer loop into the grid search optimisation; no leakage of validation set knowledge into the SVR optimisation. The outer test data are reused as training data, but within each split are only used to measure generalisation error. > > Is that clear? > > On 29 November 2016 at 10:30, Daniel Homola wrote: > Dear all, > > I was wondering if the following example code is valid: > http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html > > My understanding is, that the point of nested cross-validation is to prevent any data leakage from the inner grid-search/param optimization CV loop into the outer model evaluation CV loop. This could be achieved if the outer CV loop's test data is completely separated from the inner loop's CV, as shown here: > https://mlr-org.github.io/mlr-tutorial/release/html/img/nested_resampling.png > > The code in the above example however doesn't seem to achieve this in any way. > > Am I missing something here? > > Thanks a lot, > dh > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From gael.varoquaux at normalesup.org Tue Nov 29 02:11:36 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 29 Nov 2016 08:11:36 +0100 Subject: [scikit-learn] How to not recalculate transformer in a Pipeline? In-Reply-To: References: <20161128154635.GD1767895@phare.normalesup.org> <20161128171526.GI1767895@phare.normalesup.org> <20161128185121.GA2031543@phare.normalesup.org> <20161128221722.GG2031543@phare.normalesup.org> Message-ID: <20161129071136.GL2031543@phare.normalesup.org> On Tue, Nov 29, 2016 at 10:13:00AM +1100, Joel Nothman wrote: > - We have had PRs #3951 > and #2086 > that build > memoising into Pipeline in one way or another. Sorry, I had in mind that this was discussed, but I hadn't realized that they were PRs. I think that 3951 is a good start. I would have comments on it, but maybe I should make them in the PR. > - Andy and I have previously discussed alternative ways to set > parameters to avoid indirection issues created by wrappers. I feel that these approaches are much more invasive. The nice thing about a memoized pipeline is that is a a fairly local change. I'll comment on 3951 in terms of this specific realization, but we can discuss here if we want to take it further. Ga?l From joel.nothman at gmail.com Tue Nov 29 03:01:30 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 29 Nov 2016 19:01:30 +1100 Subject: [scikit-learn] How to not recalculate transformer in a Pipeline? In-Reply-To: <20161129071136.GL2031543@phare.normalesup.org> References: <20161128154635.GD1767895@phare.normalesup.org> <20161128171526.GI1767895@phare.normalesup.org> <20161128185121.GA2031543@phare.normalesup.org> <20161128221722.GG2031543@phare.normalesup.org> <20161129071136.GL2031543@phare.normalesup.org> Message-ID: But that the issue of model memoising isn't limited to pipeline. On 29 November 2016 at 18:11, Gael Varoquaux wrote: > On Tue, Nov 29, 2016 at 10:13:00AM +1100, Joel Nothman wrote: > > - We have had PRs #3951 > > and #2086 > > that build > > memoising into Pipeline in one way or another. > > Sorry, I had in mind that this was discussed, but I hadn't realized that > they were PRs. I think that 3951 is a good start. I would have comments > on it, but maybe I should make them in the PR. > > > - Andy and I have previously discussed alternative ways to set > > parameters to avoid indirection issues created by wrappers. > > I feel that these approaches are much more invasive. The nice thing about > a memoized pipeline is that is a a fairly local change. > > I'll comment on 3951 in terms of this specific realization, but we can > discuss here if we want to take it further. > > Ga?l > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From albertthomas88 at gmail.com Tue Nov 29 04:04:49 2016 From: albertthomas88 at gmail.com (Albert Thomas) Date: Tue, 29 Nov 2016 09:04:49 +0000 Subject: [scikit-learn] Problem with nested cross-validation example? In-Reply-To: <110F77EA-E55B-4425-9E80-2B29A3997C3E@gmail.com> References: <110F77EA-E55B-4425-9E80-2B29A3997C3E@gmail.com> Message-ID: When I was reading Sebastian's blog posts on Cross Validation a few weeks ago I also found the example of Nested cross validation on scikit-learn. At first like Daniel I thought the example was not doing what it should be doing. But after a few minutes I finally realized that it was correct. So I am for a bit more clarification. Albert On Tue, 29 Nov 2016 at 02:53, Sebastian Raschka wrote: > On first glance, the image shown in the image and the code example seem to > do/show the same thing? Maybe it would be worth adding an explanatory > figure like this to the docs to clarify? > > > On Nov 28, 2016, at 7:07 PM, Joel Nothman > wrote: > > > > If that clarifies, please offer changes to the example (as a pull > request) that make this clearer. > > > > On 29 November 2016 at 11:06, Joel Nothman > wrote: > > Briefly: > > > > clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=inner_cv) > > nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv) > > > > Each train/test split in cross_val_score holds out test data. > GridSearchCV then splits each train set into (inner-)train and validation > sets. There is no leakage of test set knowledge from the outer loop into > the grid search optimisation; no leakage of validation set knowledge into > the SVR optimisation. The outer test data are reused as training data, but > within each split are only used to measure generalisation error. > > > > Is that clear? > > > > On 29 November 2016 at 10:30, Daniel Homola > wrote: > > Dear all, > > > > I was wondering if the following example code is valid: > > > http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html > > > > My understanding is, that the point of nested cross-validation is to > prevent any data leakage from the inner grid-search/param optimization CV > loop into the outer model evaluation CV loop. This could be achieved if the > outer CV loop's test data is completely separated from the inner loop's CV, > as shown here: > > > https://mlr-org.github.io/mlr-tutorial/release/html/img/nested_resampling.png > > > > The code in the above example however doesn't seem to achieve this in > any way. > > > > Am I missing something here? > > > > Thanks a lot, > > dh > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Tue Nov 29 04:50:47 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 29 Nov 2016 20:50:47 +1100 Subject: [scikit-learn] Problem with nested cross-validation example? In-Reply-To: References: <110F77EA-E55B-4425-9E80-2B29A3997C3E@gmail.com> Message-ID: This makes me a little sad. Do Albert and Daniel think the explicit reference from blurb to code proposed at https://github.com/scikit-learn/scikit-learn/pull/7949 is a sufficient remedy? Otherwise could you please propose another clarifying change? Thanks. On 29 November 2016 at 20:04, Albert Thomas wrote: > When I was reading Sebastian's blog posts on Cross Validation a few weeks > ago I also found the example of Nested cross validation on scikit-learn. At > first like Daniel I thought the example was not doing what it should be > doing. But after a few minutes I finally realized that it was correct. So I > am for a bit more clarification. > > Albert > > On Tue, 29 Nov 2016 at 02:53, Sebastian Raschka > wrote: > >> On first glance, the image shown in the image and the code example seem >> to do/show the same thing? Maybe it would be worth adding an explanatory >> figure like this to the docs to clarify? >> >> > On Nov 28, 2016, at 7:07 PM, Joel Nothman >> wrote: >> > >> > If that clarifies, please offer changes to the example (as a pull >> request) that make this clearer. >> > >> > On 29 November 2016 at 11:06, Joel Nothman >> wrote: >> > Briefly: >> > >> > clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=inner_cv) >> > nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv) >> > >> > Each train/test split in cross_val_score holds out test data. >> GridSearchCV then splits each train set into (inner-)train and validation >> sets. There is no leakage of test set knowledge from the outer loop into >> the grid search optimisation; no leakage of validation set knowledge into >> the SVR optimisation. The outer test data are reused as training data, but >> within each split are only used to measure generalisation error. >> > >> > Is that clear? >> > >> > On 29 November 2016 at 10:30, Daniel Homola >> wrote: >> > Dear all, >> > >> > I was wondering if the following example code is valid: >> > http://scikit-learn.org/stable/auto_examples/model_ >> selection/plot_nested_cross_validation_iris.html >> > >> > My understanding is, that the point of nested cross-validation is to >> prevent any data leakage from the inner grid-search/param optimization CV >> loop into the outer model evaluation CV loop. This could be achieved if the >> outer CV loop's test data is completely separated from the inner loop's CV, >> as shown here: >> > https://mlr-org.github.io/mlr-tutorial/release/html/img/ >> nested_resampling.png >> > >> > The code in the above example however doesn't seem to achieve this in >> any way. >> > >> > Am I missing something here? >> > >> > Thanks a lot, >> > dh >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.homola11 at imperial.ac.uk Tue Nov 29 04:53:44 2016 From: daniel.homola11 at imperial.ac.uk (Daniel Homola) Date: Tue, 29 Nov 2016 09:53:44 +0000 Subject: [scikit-learn] Problem with nested cross-validation example? In-Reply-To: References: <110F77EA-E55B-4425-9E80-2B29A3997C3E@gmail.com> Message-ID: Hi Joel, Unfortunately, the link says "artifact not found". Whatever that means.. On 29/11/16 09:50, Joel Nothman wrote: > This makes me a little sad. Do Albert and Daniel think the explicit > reference from blurb to code proposed at > https://github.com/scikit-learn/scikit-learn/pull/7949 is a sufficient > remedy? Otherwise could you please propose another clarifying change? > Thanks. > > On 29 November 2016 at 20:04, Albert Thomas > wrote: > > When I was reading Sebastian's blog posts on Cross Validation a > few weeks ago I also found the example of Nested cross validation > on scikit-learn. At first like Daniel I thought the example was > not doing what it should be doing. But after a few minutes I > finally realized that it was correct. So I am for a bit more > clarification. > > Albert > > On Tue, 29 Nov 2016 at 02:53, Sebastian Raschka > > wrote: > > On first glance, the image shown in the image and the code > example seem to do/show the same thing? Maybe it would be > worth adding an explanatory figure like this to the docs to > clarify? > > > On Nov 28, 2016, at 7:07 PM, Joel Nothman > > wrote: > > > > If that clarifies, please offer changes to the example (as a > pull request) that make this clearer. > > > > On 29 November 2016 at 11:06, Joel Nothman > > wrote: > > Briefly: > > > > clf = GridSearchCV(estimator=svr, param_grid=p_grid, > cv=inner_cv) > > nested_score = cross_val_score(clf, X=X_iris, y=y_iris, > cv=outer_cv) > > > > Each train/test split in cross_val_score holds out test > data. GridSearchCV then splits each train set into > (inner-)train and validation sets. There is no leakage of test > set knowledge from the outer loop into the grid search > optimisation; no leakage of validation set knowledge into the > SVR optimisation. The outer test data are reused as training > data, but within each split are only used to measure > generalisation error. > > > > Is that clear? > > > > On 29 November 2016 at 10:30, Daniel Homola > > wrote: > > Dear all, > > > > I was wondering if the following example code is valid: > > > http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html > > > > > My understanding is, that the point of nested > cross-validation is to prevent any data leakage from the inner > grid-search/param optimization CV loop into the outer model > evaluation CV loop. This could be achieved if the outer CV > loop's test data is completely separated from the inner loop's > CV, as shown here: > > > https://mlr-org.github.io/mlr-tutorial/release/html/img/nested_resampling.png > > > > > The code in the above example however doesn't seem to > achieve this in any way. > > > > Am I missing something here? > > > > Thanks a lot, > > dh > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.homola11 at imperial.ac.uk Tue Nov 29 04:51:59 2016 From: daniel.homola11 at imperial.ac.uk (Daniel Homola) Date: Tue, 29 Nov 2016 09:51:59 +0000 Subject: [scikit-learn] Problem with nested cross-validation example? In-Reply-To: References: Message-ID: Hi Joel, Thanks a lot for the answer. "Each train/test split in cross_val_score holds out test data. GridSearchCV then splits each train set into (inner-)train and validation sets. " I know this is what nested CV supposed to do but the code is doing an excellent job at obscuring this. I'll try and add some clarification in as comments later today. Cheers, d On 29/11/16 00:07, Joel Nothman wrote: > If that clarifies, please offer changes to the example (as a pull > request) that make this clearer. > > On 29 November 2016 at 11:06, Joel Nothman > wrote: > > Briefly: > > clf = GridSearchCV > (estimator=svr, param_grid=p_grid, cv=inner_cv) > nested_score = cross_val_score > (clf, X=X_iris, y=y_iris, cv=outer_cv) > > > Each train/test split in cross_val_score holds out test data. > GridSearchCV then splits each train set into (inner-)train and > validation sets. There is no leakage of test set knowledge from > the outer loop into the grid search optimisation; no leakage of > validation set knowledge into the SVR optimisation. The outer test > data are reused as training data, but within each split are only > used to measure generalisation error. > > Is that clear? > > On 29 November 2016 at 10:30, Daniel Homola > wrote: > > Dear all, > > > I was wondering if the following example code is valid: > > http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html > > > My understanding is, that the point of nested cross-validation > is to prevent any data leakage from the > inner grid-search/param optimization CV loop into the > outer model evaluation CV loop. This could be achieved if the > outer CV loop's test data is completely separated from the > inner loop's CV, as shown here: > > https://mlr-org.github.io/mlr-tutorial/release/html/img/nested_resampling.png > > > > The code in the above example however doesn't seem to achieve > this in any way. > > > Am I missing something here? > > > Thanks a lot, > > dh > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From albertthomas88 at gmail.com Tue Nov 29 05:42:21 2016 From: albertthomas88 at gmail.com (Albert Thomas) Date: Tue, 29 Nov 2016 10:42:21 +0000 Subject: [scikit-learn] Problem with nested cross-validation example? In-Reply-To: References: Message-ID: I also get "artifact not found". And I agree with Daniel. Once you decompose what the code is doing you realize that it does the job. The simplicity of the code to perform nested cross validation using scikit learn objects is impressive but I guess it also makes it less obvious. So making the example clearer by explaining what the code does or by adding a few comments can be useful for others. Albert On Tue, 29 Nov 2016 at 11:19, Daniel Homola wrote: > Hi Joel, > > Thanks a lot for the answer. > > "Each train/test split in cross_val_score holds out test data. > GridSearchCV then splits each train set into (inner-)train and validation > sets. " > > I know this is what nested CV supposed to do but the code is doing an > excellent job at obscuring this. I'll try and add some clarification in as > comments later today. > > Cheers, > > d > > > On 29/11/16 00:07, Joel Nothman wrote: > > If that clarifies, please offer changes to the example (as a pull request) > that make this clearer. > > On 29 November 2016 at 11:06, Joel Nothman wrote: > > Briefly: > > clf = GridSearchCV (estimator=svr, param_grid=p_grid, cv=inner_cv)nested_score = cross_val_score (clf, X=X_iris, y=y_iris, cv=outer_cv) > > > Each train/test split in cross_val_score holds out test data. GridSearchCV > then splits each train set into (inner-)train and validation sets. There is > no leakage of test set knowledge from the outer loop into the grid search > optimisation; no leakage of validation set knowledge into the SVR > optimisation. The outer test data are reused as training data, but within > each split are only used to measure generalisation error. > > Is that clear? > > On 29 November 2016 at 10:30, Daniel Homola wrote: > > Dear all, > > > I was wondering if the following example code is valid: > > > http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html > > My understanding is, that the point of nested cross-validation is to > prevent any data leakage from the inner grid-search/param optimization CV > loop into the outer model evaluation CV loop. This could be achieved if the > outer CV loop's test data is completely separated from the inner loop's CV, > as shown here: > > > https://mlr-org.github.io/mlr-tutorial/release/html/img/nested_resampling.png > > > The code in the above example however doesn't seem to achieve this in any > way. > > > Am I missing something here? > > > Thanks a lot, > > dh > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Tue Nov 29 05:48:39 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 29 Nov 2016 21:48:39 +1100 Subject: [scikit-learn] Problem with nested cross-validation example? In-Reply-To: References: Message-ID: Wait an hour for the docs to build and you won't get artifact not found :) If you'd looked at the PR diff, you'd see I've modified the description to refer directly to GridSearchCV and cross_val_score: In the inner loop (here executed by GridSearchCV), the score is > approximately maximized by fitting a model to each training set, and then > directly maximized in selecting (hyper)parameters over the validation set. > In the outer loop (here in cross_val_score), ... Further comments in the code are welcome. On 29 November 2016 at 21:42, Albert Thomas wrote: > I also get "artifact not found". And I agree with Daniel. > > Once you decompose what the code is doing you realize that it does the > job. The simplicity of the code to perform nested cross validation using > scikit learn objects is impressive but I guess it also makes it less > obvious. So making the example clearer by explaining what the code does or > by adding a few comments can be useful for others. > > Albert > > On Tue, 29 Nov 2016 at 11:19, Daniel Homola uk> wrote: > >> Hi Joel, >> >> Thanks a lot for the answer. >> >> "Each train/test split in cross_val_score holds out test data. >> GridSearchCV then splits each train set into (inner-)train and validation >> sets. " >> >> I know this is what nested CV supposed to do but the code is doing an >> excellent job at obscuring this. I'll try and add some clarification in as >> comments later today. >> >> Cheers, >> >> d >> >> >> On 29/11/16 00:07, Joel Nothman wrote: >> >> If that clarifies, please offer changes to the example (as a pull >> request) that make this clearer. >> >> On 29 November 2016 at 11:06, Joel Nothman >> wrote: >> >> Briefly: >> >> clf = GridSearchCV (estimator=svr, param_grid=p_grid, cv=inner_cv)nested_score = cross_val_score (clf, X=X_iris, y=y_iris, cv=outer_cv) >> >> >> Each train/test split in cross_val_score holds out test data. >> GridSearchCV then splits each train set into (inner-)train and validation >> sets. There is no leakage of test set knowledge from the outer loop into >> the grid search optimisation; no leakage of validation set knowledge into >> the SVR optimisation. The outer test data are reused as training data, but >> within each split are only used to measure generalisation error. >> >> Is that clear? >> >> On 29 November 2016 at 10:30, Daniel Homola >> wrote: >> >> Dear all, >> >> >> I was wondering if the following example code is valid: >> >> http://scikit-learn.org/stable/auto_examples/model_ >> selection/plot_nested_cross_validation_iris.html >> >> My understanding is, that the point of nested cross-validation is to >> prevent any data leakage from the inner grid-search/param optimization CV >> loop into the outer model evaluation CV loop. This could be achieved if the >> outer CV loop's test data is completely separated from the inner loop's CV, >> as shown here: >> >> https://mlr-org.github.io/mlr-tutorial/release/html/img/ >> nested_resampling.png >> >> >> The code in the above example however doesn't seem to achieve this in any >> way. >> >> >> Am I missing something here? >> >> >> Thanks a lot, >> >> dh >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> >> _______________________________________________ >> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.homola11 at imperial.ac.uk Tue Nov 29 06:01:04 2016 From: daniel.homola11 at imperial.ac.uk (Daniel Homola) Date: Tue, 29 Nov 2016 11:01:04 +0000 Subject: [scikit-learn] Problem with nested cross-validation example? In-Reply-To: References: Message-ID: Sorry, should've done that. Thanks for the PR. To me it isn't the actual concept of nested CV that needs more detailed explanation but the implementation in scikit-learn. I think it's not obvious at all for a newcomer (heck, I've been using it for years on and off and even I got confused) that the clf GridSearch object will carry it's inner CV object into the cross_val_score function, which has it's own outer CV object. Unless you know that in scikit-learn the CV object of an estimator is *NOT* overloaded with the cross_val_score function's cv parameter, but rather it will result in a nested CV, you simply cannot work out why this example works.. This is the confusing bit I think.. Do you want me to add comments that highlight this issue? On 29/11/16 10:48, Joel Nothman wrote: > Wait an hour for the docs to build and you won't get artifact not > found :) > > If you'd looked at the PR diff, you'd see I've modified the > description to refer directly to GridSearchCV and cross_val_score: > > In the inner loop (here executed by |GridSearchCV|), the score is > approximately maximized by fitting a model to each training set, > and then directly maximized in selecting (hyper)parameters over > the validation set. In the outer loop (here in |cross_val_score|), ... > > > Further comments in the code are welcome. > > On 29 November 2016 at 21:42, Albert Thomas > wrote: > > I also get "artifact not found". And I agree with Daniel. > > Once you decompose what the code is doing you realize that it does > the job. The simplicity of the code to perform nested cross > validation using scikit learn objects is impressive but I guess it > also makes it less obvious. So making the example clearer by > explaining what the code does or by adding a few comments can be > useful for others. > > Albert > > On Tue, 29 Nov 2016 at 11:19, Daniel Homola > > wrote: > > Hi Joel, > > Thanks a lot for the answer. > > "Each train/test split in cross_val_score holds out test data. > GridSearchCV then splits each train set into (inner-)train and > validation sets. " > > I know this is what nested CV supposed to do but the code is > doing an excellent job at obscuring this. I'll try and add > some clarification in as comments later today. > > Cheers, > > d > > > On 29/11/16 00:07, Joel Nothman wrote: >> If that clarifies, please offer changes to the example (as a >> pull request) that make this clearer. >> >> On 29 November 2016 at 11:06, Joel Nothman >> > wrote: >> >> Briefly: >> >> clf = GridSearchCV >> (estimator=svr, param_grid=p_grid, cv=inner_cv) >> nested_score = cross_val_score >> (clf, X=X_iris, y=y_iris, cv=outer_cv) >> >> >> Each train/test split in cross_val_score holds out test >> data. GridSearchCV then splits each train set into >> (inner-)train and validation sets. There is no leakage of >> test set knowledge from the outer loop into the grid >> search optimisation; no leakage of validation set >> knowledge into the SVR optimisation. The outer test data >> are reused as training data, but within each split are >> only used to measure generalisation error. >> >> Is that clear? >> >> On 29 November 2016 at 10:30, Daniel Homola >> > wrote: >> >> Dear all, >> >> >> I was wondering if the following example code is valid: >> >> http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html >> >> >> My understanding is, that the point of nested >> cross-validation is to prevent any data leakage from >> the inner grid-search/param optimization CV loop into >> the outer model evaluation CV loop. This could be >> achieved if the outer CV loop's test data is >> completely separated from the inner loop's CV, as >> shown here: >> >> https://mlr-org.github.io/mlr-tutorial/release/html/img/nested_resampling.png >> >> >> >> The code in the above example however doesn't seem to >> achieve this in any way. >> >> >> Am I missing something here? >> >> >> Thanks a lot, >> >> dh >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ scikit-learn > mailing list scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ scikit-learn > mailing list scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Tue Nov 29 06:12:28 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 29 Nov 2016 22:12:28 +1100 Subject: [scikit-learn] Problem with nested cross-validation example? In-Reply-To: References: Message-ID: Offer whatever patches you think will help. On 29 November 2016 at 22:01, Daniel Homola wrote: > Sorry, should've done that. > > Thanks for the PR. To me it isn't the actual concept of nested CV that > needs more detailed explanation but the implementation in scikit-learn. > > I think it's not obvious at all for a newcomer (heck, I've been using it > for years on and off and even I got confused) that the clf GridSearch > object will carry it's inner CV object into the cross_val_score function, > which has it's own outer CV object. Unless you know that in scikit-learn > the CV object of an estimator is *NOT* overloaded with the > cross_val_score function's cv parameter, but rather it will result in a > nested CV, you simply cannot work out why this example works.. This is the > confusing bit I think.. Do you want me to add comments that highlight this > issue? > > > On 29/11/16 10:48, Joel Nothman wrote: > > Wait an hour for the docs to build and you won't get artifact not found :) > > If you'd looked at the PR diff, you'd see I've modified the description to > refer directly to GridSearchCV and cross_val_score: > > In the inner loop (here executed by GridSearchCV), the score is >> approximately maximized by fitting a model to each training set, and then >> directly maximized in selecting (hyper)parameters over the validation set. >> In the outer loop (here in cross_val_score), ... > > > Further comments in the code are welcome. > > On 29 November 2016 at 21:42, Albert Thomas > wrote: > >> I also get "artifact not found". And I agree with Daniel. >> >> Once you decompose what the code is doing you realize that it does the >> job. The simplicity of the code to perform nested cross validation using >> scikit learn objects is impressive but I guess it also makes it less >> obvious. So making the example clearer by explaining what the code does or >> by adding a few comments can be useful for others. >> >> Albert >> >> On Tue, 29 Nov 2016 at 11:19, Daniel Homola < >> daniel.homola11 at imperial.ac.uk> wrote: >> >>> Hi Joel, >>> >>> Thanks a lot for the answer. >>> >>> "Each train/test split in cross_val_score holds out test data. >>> GridSearchCV then splits each train set into (inner-)train and validation >>> sets. " >>> >>> I know this is what nested CV supposed to do but the code is doing an >>> excellent job at obscuring this. I'll try and add some clarification in as >>> comments later today. >>> >>> Cheers, >>> >>> d >>> >>> >>> On 29/11/16 00:07, Joel Nothman wrote: >>> >>> If that clarifies, please offer changes to the example (as a pull >>> request) that make this clearer. >>> >>> On 29 November 2016 at 11:06, Joel Nothman >>> wrote: >>> >>> Briefly: >>> >>> clf = GridSearchCV (estimator=svr, param_grid=p_grid, cv=inner_cv)nested_score = cross_val_score (clf, X=X_iris, y=y_iris, cv=outer_cv) >>> >>> >>> Each train/test split in cross_val_score holds out test data. >>> GridSearchCV then splits each train set into (inner-)train and validation >>> sets. There is no leakage of test set knowledge from the outer loop into >>> the grid search optimisation; no leakage of validation set knowledge into >>> the SVR optimisation. The outer test data are reused as training data, but >>> within each split are only used to measure generalisation error. >>> >>> Is that clear? >>> >>> On 29 November 2016 at 10:30, Daniel Homola >>> wrote: >>> >>> Dear all, >>> >>> >>> I was wondering if the following example code is valid: >>> >>> http://scikit-learn.org/stable/auto_examples/model_selection >>> /plot_nested_cross_validation_iris.html >>> >>> My understanding is, that the point of nested cross-validation is to >>> prevent any data leakage from the inner grid-search/param optimization CV >>> loop into the outer model evaluation CV loop. This could be achieved if the >>> outer CV loop's test data is completely separated from the inner loop's CV, >>> as shown here: >>> >>> https://mlr-org.github.io/mlr-tutorial/release/html/img/nest >>> ed_resampling.png >>> >>> >>> The code in the above example however doesn't seem to achieve this in >>> any way. >>> >>> >>> Am I missing something here? >>> >>> >>> Thanks a lot, >>> >>> dh >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ scikit-learn mailing >>> list scikit-learn at python.org https://mail.python.org/mailma >>> n/listinfo/scikit-learn >> >> _______________________________________________ scikit-learn mailing >> list scikit-learn at python.org https://mail.python.org/mailma >> n/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Tue Nov 29 09:10:56 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Tue, 29 Nov 2016 09:10:56 -0500 Subject: [scikit-learn] Problem with nested cross-validation example? In-Reply-To: References: Message-ID: I have an ipynb where I did the nested CV more ?manually? in sklearn 0.17 vs sklearn 0.18 ? I intended to add it as an appendix to a blog article (model eval part 4), which I had no chance to write, yet. Maybe the sklearn 0.17 part is a bit more obvious (although way less elegant) than the sklearn 0.18 version and is helpful in some sort to see what?s going on: https://github.com/rasbt/pattern_classification/blob/master/data_viz/model-evaluation-articles/nested_cv_code.ipynb (haven?t had a chance to add comments yet, though). Btw. does anyone have a good (research article) reference for nested CV? I see people often referrering to Dietterich [1], who mentions 5x2 CV. However, I think his 5x2 CV approach is different from the ?nested cross-validation? that is commonly used since the 5x2 example is just 2-fold CV repeated 5 times (10 estimates). Maybe Sudhir & Simon [2] would be a better reference? However, they seem to only hold out 1 test sample in the outer fold? Does anyone know of a nice empirical study on nested CV (sth. like Ron Kohavi's for k-fold CV)? [1] Dietterich, Thomas G. 1998. ?Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms.? Neural Computation 10 (7). MIT Press 238 Main St., Suite 500, Cambridge, MA 02142-1046 USA journals-info at mit.edu: 1895?1923. doi:10.1162/089976698300017197. [2] Varma, Sudhir, and Richard Simon. 2006. ?Bias in Error Estimation When Using Cross-Validation for Model Selection.? BMC Bioinformatics 7: 91. doi:10.1186/1471-2105-7-91. > On Nov 29, 2016, at 6:12 AM, Joel Nothman wrote: > > Offer whatever patches you think will help. > > On 29 November 2016 at 22:01, Daniel Homola wrote: > Sorry, should've done that. > Thanks for the PR. To me it isn't the actual concept of nested CV that needs more detailed explanation but the implementation in scikit-learn. > I think it's not obvious at all for a newcomer (heck, I've been using it for years on and off and even I got confused) that the clf GridSearch object will carry it's inner CV object into the cross_val_score function, which has it's own outer CV object. Unless you know that in scikit-learn the CV object of an estimator is NOT overloaded with the cross_val_score function's cv parameter, but rather it will result in a nested CV, you simply cannot work out why this example works.. This is the confusing bit I think.. Do you want me to add comments that highlight this issue? > > > On 29/11/16 10:48, Joel Nothman wrote: >> Wait an hour for the docs to build and you won't get artifact not found :) >> >> If you'd looked at the PR diff, you'd see I've modified the description to refer directly to GridSearchCV and cross_val_score: >> >> In the inner loop (here executed by GridSearchCV), the score is approximately maximized by fitting a model to each training set, and then directly maximized in selecting (hyper)parameters over the validation set. In the outer loop (here in cross_val_score), ... >> >> Further comments in the code are welcome. >> >> On 29 November 2016 at 21:42, Albert Thomas wrote: >> I also get "artifact not found". And I agree with Daniel. >> >> Once you decompose what the code is doing you realize that it does the job. The simplicity of the code to perform nested cross validation using scikit learn objects is impressive but I guess it also makes it less obvious. So making the example clearer by explaining what the code does or by adding a few comments can be useful for others. >> >> Albert >> >> On Tue, 29 Nov 2016 at 11:19, Daniel Homola wrote: >> Hi Joel, >> >> Thanks a lot for the answer. >> "Each train/test split in cross_val_score holds out test data. GridSearchCV then splits each train set into (inner-)train and validation sets. " >> >> I know this is what nested CV supposed to do but the code is doing an excellent job at obscuring this. I'll try and add some clarification in as comments later today. >> >> Cheers, >> >> d >> >> On 29/11/16 00:07, Joel Nothman wrote: >>> If that clarifies, please offer changes to the example (as a pull request) that make this clearer. >>> >>> On 29 November 2016 at 11:06, Joel Nothman wrote: >>> Briefly: >>> >>> clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=inner_cv) >>> nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv) >>> >>> Each train/test split in cross_val_score holds out test data. GridSearchCV then splits each train set into (inner-)train and validation sets. There is no leakage of test set knowledge from the outer loop into the grid search optimisation; no leakage of validation set knowledge into the SVR optimisation. The outer test data are reused as training data, but within each split are only used to measure generalisation error. >>> >>> Is that clear? >>> >>> On 29 November 2016 at 10:30, Daniel Homola wrote: >>> Dear all, >>> >>> I was wondering if the following example code is valid: >>> http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html >>> >>> My understanding is, that the point of nested cross-validation is to prevent any data leakage from the inner grid-search/param optimization CV loop into the outer model evaluation CV loop. This could be achieved if the outer CV loop's test data is completely separated from the inner loop's CV, as shown here: >>> https://mlr-org.github.io/mlr-tutorial/release/html/img/nested_resampling.png >>> >>> The code in the above example however doesn't seem to achieve this in any way. >>> >>> Am I missing something here? >>> >>> Thanks a lot, >>> dh >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> >>> >>> >>> ______________________________ >>> _________________ >>> scikit-learn mailing list >>> >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn >> >> ______________________________ >> _________________ >> scikit-learn mailing list >> >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From alfonso82 at kaist.ac.kr Tue Nov 29 10:06:09 2016 From: alfonso82 at kaist.ac.kr (=?UTF-8?B?QUxWQVJFTkdBIEdBTUVSTyAgQUxGT05TTyBBQlJBSEFN?=) Date: Wed, 30 Nov 2016 00:06:09 +0900 (KST) Subject: [scikit-learn] =?utf-8?q?Bugs_in_Tree=2Epy?= Message-ID: <583d9d003fe7_@_imoxion.com> sklearn/tree/tree.pyWith the new 0.18 version, it is possible to add percentages values for "min_samples_split"#.. versionchanged:: 0.18#Added float values for percentages.How ever, a value of 1 will make the program to issue an ValueError (lines 195-199), since 1 is an Integer and does not hold the condition of being bigge ror equal than 2. It is quite easy to solve by hand (if not 2 <= self.min_samples_split and self.min_samples_split != 1: in line 196), but I'm pretty sure there has to be a clever way to solve it. I might go back to that later, as there might be more bugs as this one with the new options in version 0.18.Thank you! -------------- next part -------------- An HTML attachment was scrubbed... URL: From nfliu at uw.edu Tue Nov 29 13:44:52 2016 From: nfliu at uw.edu (Nelson Liu) Date: Tue, 29 Nov 2016 18:44:52 +0000 Subject: [scikit-learn] Bugs in Tree.py In-Reply-To: <583d9d003fe7_@_imoxion.com> References: <583d9d003fe7_@_imoxion.com> Message-ID: Hi, I think this is working as the docs say; 1 is an integer and is thus treated as a raw number of samples. If you wanted a percentage value of 100%, you'd have to pass in the float 1.0. I recall a related issue being raised here: https://github.com/scikit-learn/scikit-learn/issues/7603 Also, I don't see how lines 195-199 in tree.py would issue a value error...could you recheck the line numbers? Nelson Liu On Tue, Nov 29, 2016 at 7:15 AM ALVARENGA GAMERO ALFONSO ABRAHAM < alfonso82 at kaist.ac.kr> wrote: > sklearn/tree/tree.py > > > > With the new 0.18 version, it is possible to add percentages values for > "min_samples_split" > > > > #.. versionchanged:: 0.18 > > #Added float values for percentages. > > > > How ever, a value of 1 will make the program to issue an ValueError (lines > 195-199), since 1 is an Integer and does not hold the condition of being > bigge ror equal than 2. It is quite easy to solve by hand (if not 2 <= > self.min_samples_split and self.min_samples_split != 1: in line 196), but > I'm pretty sure there has to be a clever way to solve it. I might go back > to that later, as there might be more bugs as this one with the new options > in version 0.18. > > > > Thank you! > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Tue Nov 29 14:24:26 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 30 Nov 2016 06:24:26 +1100 Subject: [scikit-learn] Bugs in Tree.py In-Reply-To: References: <583d9d003fe7_@_imoxion.com> Message-ID: "percentages" should be "fractions" or "proportions". On 30 November 2016 at 05:44, Nelson Liu wrote: > Hi, > I think this is working as the docs say; 1 is an integer and is thus > treated as a raw number of samples. If you wanted a percentage value of > 100%, you'd have to pass in the float 1.0. I recall a related issue being > raised here: https://github.com/scikit-learn/scikit-learn/issues/7603 > > Also, I don't see how lines 195-199 in tree.py would issue a value > error...could you recheck the line numbers? > > Nelson Liu > > On Tue, Nov 29, 2016 at 7:15 AM ALVARENGA GAMERO ALFONSO ABRAHAM < > alfonso82 at kaist.ac.kr> wrote: > >> sklearn/tree/tree.py >> >> >> >> With the new 0.18 version, it is possible to add percentages values for >> "min_samples_split" >> >> >> >> #.. versionchanged:: 0.18 >> >> #Added float values for percentages. >> >> >> >> How ever, a value of 1 will make the program to issue an ValueError >> (lines 195-199), since 1 is an Integer and does not hold the condition of >> being bigge ror equal than 2. It is quite easy to solve by hand (if not 2 >> <= self.min_samples_split and self.min_samples_split != 1: in line 196), >> but I'm pretty sure there has to be a clever way to solve it. I might go >> back to that later, as there might be more bugs as this one with the new >> options in version 0.18. >> >> >> >> Thank you! >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tommaso.costanzo01 at gmail.com Wed Nov 30 12:17:15 2016 From: tommaso.costanzo01 at gmail.com (Tommaso Costanzo) Date: Wed, 30 Nov 2016 12:17:15 -0500 Subject: [scikit-learn] Bayesian Gaussian Mixture In-Reply-To: <3bd52b79-f822-886e-e29c-bbe7487f42fd@gmail.com> References: <3bd52b79-f822-886e-e29c-bbe7487f42fd@gmail.com> Message-ID: Dear Andreas, thank you so much for your answser now I can see my mistake. What I am trying to do is convince myself that the fact that when I analyze my data I am getting probability of only 0 and 1 is it because the data are well separated so I was trying to make some synthetic data where there is a probabioity different from 0 or 1, but I did it in the wrong way. Does it sounds correct if I make 300 samples with random number centered at 0 and STD 1 and other 300 centered at 0.5 and then adding some samples in between these two gaussian distributions (say in between 0.15 and 0.35)? In this case I think that I should expect probability different from 0 or 1 in the two components (when using 2 components). Thank you in advance Tommaso On Nov 28, 2016 11:58 AM, "Andreas Mueller" wrote: > Hi Tommaso. > So what's the issue? The distributions are very distinct, so there is no > confusion. > The higher the dimensionality, the further apart the points are (compare > the distance between (-1, 1) and (1, -1) to the one between (-1, -.5, 0, > .5, 1) and (1, .5, 0, -.5, -1). > I'm not sure what you mean by "the cross in the middle". > You create two fixed points, one at np.arange(-1,1, 2.0/nfeatures) and one > at np.arange(1,-1, (-2.0/nfeatures)). In high dimensions, these points are > very far apart. > Then you add standard normal noise to it. So this data is two perfect > Gaussians. In low dimensions, they are "close together" so there is some > confusion, > in high dimensions, they are "far apart" so there is less confusion. > > Hth, > Andy > > On 11/27/2016 11:47 AM, Tommaso Costanzo wrote: > > Hi Jacob, > > I have just changed my code from BayesianGaussianMixture to > GaussianMixture, and the results is the same. I attached here the picture > of the first component when I runned the code with 5, 10, and 50 nfeatures > and 2 components. In my short test function I expect to have point that > they can be in one component as well as another has visible for small > number of nfeatures, but 0 1 for nfeatures >50 does not sounds correct. > Seems that is just related to the size of the model and in particular to > the number of features. With the BayesianGaussianMixture I have seen that > it is sligthly better to increase the degree of freedoms to 2*nfeatures > instead of the default nfeatures. However, this does not change the result > when the nfeatures are 50 or more. > > Thank you in advance > Tommaso > > 2016-11-25 21:32 GMT-05:00 Jacob Schreiber : > >> Typically this means that the model is so confident in its predictions it >> does not believe it possible for the sample to come from the other >> component. Do you get the same results with a regular GaussianMixture? >> >> On Fri, Nov 25, 2016 at 11:34 AM, Tommaso Costanzo < >> tommaso.costanzo01 at gmail.com> wrote: >> >>> Hi, >>> >>> I am facing some problem with the "BayesianGaussianMixture" function, >>> but I do not know if it is because of my poor knowledge on this type of >>> statistics or if it is something related to the algorithm. I have set of >>> data of around 1000 to 4000 observation (every feature is a spectrum of >>> around 200 point) so in the end I have n_samples = ~1000 and n_features = >>> ~20. The good things is that I am getting the same results of KMeans >>> however the "predict_proba" has value only of 0 or 1. >>> >>> I have wrote a small function to simulate my problem with random data >>> that is reported below. The first 1/2 of the array has the point with a >>> positive slope while the second 1/2 has a negative slope, so the cross in >>> the middle. What I have seen is that for a small number of features I >>> obtain good probability, but if the number of features increases (say 50) >>> than the probability become only 0 or 1. >>> Can someone help me in interpret this result? >>> >>> Here is the code I wrote with the generated random number, I'll >>> generally run it with ncomponent=2 and nfeatures=5 or 10 or 50 or 100. I am >>> not sure if it will work in every case is not very highly tested. I have >>> also attached as a file! >>> >>> ############################################################ >>> ############## >>> import numpy as np >>> >>> from sklearn.mixture import GaussianMixture, >>> BayesianGaussianMixture >>> import matplotlib.pyplot as plt >>> >>> >>> >>> def test_bgm(ncomponent, nfeatures): >>> >>> temp = np.random.randn(500,nfeatures) >>> >>> temp = temp + np.arange(-1,1, 2.0/nfeatures) >>> >>> temp1 = np.random.randn(400,nfeatures) >>> >>> temp1 = temp1 + np.arange(1,-1, (-2.0/nfeatures)) >>> >>> X = np.vstack((temp, temp1)) >>> >>> >>> >>> bgm = BayesianGaussianMixture(ncomponent,degrees_of_freedom_prior=nfeatures*2).fit(X) >>> >>> bgm_proba = bgm.predict_proba(X) >>> >>> bgm_labels = bgm.predict(X) >>> >>> >>> >>> plt.figure(-1) >>> >>> plt.imshow(bgm_labels.reshape(30,-1), origin='lower', >>> interpolatio='none') >>> plt.colorbar() >>> >>> >>> >>> for i in np.arange(0,ncomponent): >>> >>> plt.figure(i) >>> >>> plt.imshow(bgm_proba[:,i].reshape(30,-1), origin='lower', >>> interpolatio='none') >>> plt.colorbar() >>> >>> >>> >>> plt.show() >>> ############################################################ >>> ################## >>> >>> Thank you in advance >>> Tommaso >>> >>> >>> -- >>> Please do NOT send Microsoft Office Attachments: >>> http://www.gnu.org/philosophy/no-word-attachments.html >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Please do NOT send Microsoft Office Attachments: > http://www.gnu.org/philosophy/no-word-attachments.html > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Nov 30 15:50:39 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 30 Nov 2016 15:50:39 -0500 Subject: [scikit-learn] Bayesian Gaussian Mixture In-Reply-To: References: <3bd52b79-f822-886e-e29c-bbe7487f42fd@gmail.com> Message-ID: <65a64e5d-fe65-46f0-05c7-23d664ccfce2@gmail.com> There are plenty of examples and plots on the scikit-learn website. On 11/30/2016 12:17 PM, Tommaso Costanzo wrote: > > Dear Andreas, > > thank you so much for your answser now I can see my mistake. What I am > trying to do is convince myself that the fact that when I analyze my > data I am getting probability of only 0 and 1 is it because the data > are well separated so I was trying to make some synthetic data where > there is a probabioity different from 0 or 1, but I did it in the > wrong way. Does it sounds correct if I make 300 samples with random > number centered at 0 and STD 1 and other 300 centered at 0.5 and then > adding some samples in between these two gaussian distributions (say > in between 0.15 and 0.35)? In this case I think that I should expect > probability different from 0 or 1 in the two components (when using 2 > components). > > Thank you in advance > Tommaso > > On Nov 28, 2016 11:58 AM, "Andreas Mueller" > wrote: > > Hi Tommaso. > So what's the issue? The distributions are very distinct, so there > is no confusion. > The higher the dimensionality, the further apart the points are > (compare the distance between (-1, 1) and (1, -1) to the one > between (-1, -.5, 0, .5, 1) and (1, .5, 0, -.5, -1). > I'm not sure what you mean by "the cross in the middle". > You create two fixed points, one at np.arange(-1,1, 2.0/nfeatures) > and one at np.arange(1,-1, (-2.0/nfeatures)). In high dimensions, > these points are very far apart. > Then you add standard normal noise to it. So this data is two > perfect Gaussians. In low dimensions, they are "close together" so > there is some confusion, > in high dimensions, they are "far apart" so there is less confusion. > > Hth, > Andy > > On 11/27/2016 11:47 AM, Tommaso Costanzo wrote: >> Hi Jacob, >> >> I have just changed my code from BayesianGaussianMixture to >> GaussianMixture, and the results is the same. I attached here the >> picture of the first component when I runned the code with 5, 10, >> and 50 nfeatures and 2 components. In my short test function I >> expect to have point that they can be in one component as well as >> another has visible for small number of nfeatures, but 0 1 for >> nfeatures >50 does not sounds correct. Seems that is just >> related to the size of the model and in particular to the number >> of features. With the BayesianGaussianMixture I have seen that it >> is sligthly better to increase the degree of freedoms to >> 2*nfeatures instead of the default nfeatures. However, this does >> not change the result when the nfeatures are 50 or more. >> >> Thank you in advance >> Tommaso >> >> 2016-11-25 21:32 GMT-05:00 Jacob Schreiber >> >: >> >> Typically this means that the model is so confident in its >> predictions it does not believe it possible for the sample to >> come from the other component. Do you get the same results >> with a regular GaussianMixture? >> >> On Fri, Nov 25, 2016 at 11:34 AM, Tommaso Costanzo >> > > wrote: >> >> Hi, >> >> I am facing some problem with the >> "BayesianGaussianMixture" function, but I do not know if >> it is because of my poor knowledge on this type of >> statistics or if it is something related to the >> algorithm. I have set of data of around 1000 to 4000 >> observation (every feature is a spectrum of around 200 >> point) so in the end I have n_samples = ~1000 and >> n_features = ~20. The good things is that I am getting >> the same results of KMeans however the "predict_proba" >> has value only of 0 or 1. >> >> I have wrote a small function to simulate my problem with >> random data that is reported below. The first 1/2 of the >> array has the point with a positive slope while the >> second 1/2 has a negative slope, so the cross in the >> middle. What I have seen is that for a small number of >> features I obtain good probability, but if the number of >> features increases (say 50) than the probability become >> only 0 or 1. >> Can someone help me in interpret this result? >> >> Here is the code I wrote with the generated random >> number, I'll generally run it with ncomponent=2 and >> nfeatures=5 or 10 or 50 or 100. I am not sure if it will >> work in every case is not very highly tested. I have also >> attached as a file! >> >> ########################################################################## >> import numpy as np >> from sklearn.mixture import GaussianMixture, >> BayesianGaussianMixture >> import matplotlib.pyplot as plt >> >> def test_bgm(ncomponent, nfeatures): >> temp = np.random.randn(500,nfeatures) >> temp = temp + np.arange(-1,1, 2.0/nfeatures) >> temp1 = np.random.randn(400,nfeatures) >> temp1 = temp1 + np.arange(1,-1, (-2.0/nfeatures)) >> X = np.vstack((temp, temp1)) >> >> bgm = >> BayesianGaussianMixture(ncomponent,degrees_of_freedom_prior=nfeatures*2).fit(X) >> >> bgm_proba = bgm.predict_proba(X) >> bgm_labels = bgm.predict(X) >> >> plt.figure(-1) >> plt.imshow(bgm_labels.reshape(30,-1), origin='lower', >> interpolatio='none') >> plt.colorbar() >> >> for i in np.arange(0,ncomponent): >> plt.figure(i) >> plt.imshow(bgm_proba[:,i].reshape(30,-1), origin='lower', >> interpolatio='none') >> plt.colorbar() >> >> plt.show() >> ############################################################################## >> >> Thank you in advance >> Tommaso >> >> >> -- >> Please do NOT send Microsoft Office Attachments: >> http://www.gnu.org/philosophy/no-word-attachments.html >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> >> -- >> Please do NOT send Microsoft Office Attachments: >> http://www.gnu.org/philosophy/no-word-attachments.html >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ scikit-learn > mailing list scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Nov 30 15:49:19 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 30 Nov 2016 15:49:19 -0500 Subject: [scikit-learn] Problem with nested cross-validation example? In-Reply-To: References: Message-ID: On 11/29/2016 09:10 AM, Sebastian Raschka wrote: > I have an ipynb where I did the nested CV more ?manually? in sklearn 0.17 vs sklearn 0.18 ? I intended to add it as an appendix to a blog article (model eval part 4), which I had no chance to write, yet. Maybe the sklearn 0.17 part is a bit more obvious (although way less elegant) than the sklearn 0.18 version and is helpful in some sort to see what?s going on: https://github.com/rasbt/pattern_classification/blob/master/data_viz/model-evaluation-articles/nested_cv_code.ipynb (haven?t had a chance to add comments yet, though). I also got a manual vs sklearn here ;) https://github.com/amueller/introduction_to_ml_with_python/blob/master/05-model-evaluation-and-improvement.ipynb Though the explanation is behind a paywall scattered over dead trees :-/