From s.atasever at gmail.com Mon Jul 3 10:09:44 2017 From: s.atasever at gmail.com (Sema Atasever) Date: Mon, 3 Jul 2017 17:09:44 +0300 Subject: [scikit-learn] Construct the microclusters using a CF-Tree In-Reply-To: <75468a69-ba3a-ca8a-7b1c-b477f7d6f08e@gmail.com> References: <75468a69-ba3a-ca8a-7b1c-b477f7d6f08e@gmail.com> Message-ID: Dear Roman, When I try the code with the original data (*data.dat*) as you suggested, I get the following error : *Memory Error* --> (*error.png*), how can i overcome this problem, thank you so much in advance. ? data.dat ? On Fri, Jun 30, 2017 at 5:42 PM, Roman Yurchak wrote: > Hello Sema, > > On 30/06/17 17:14, Sema Atasever wrote: > >> I want to cluster them using Birch clustering algorithm. >> Does this method have 'precomputed' option. >> > > No it doesn't, see http://scikit-learn.org/stable > /modules/generated/sklearn.cluster.Birch.html so you would need to > provide it with the original features matrix (not the precomputed distance > matrix). Since your dataset is fairly small, there is no reason in > precomputing it anyway. > > I needed train an SVM on the centroids of the microclusters so >> *How can i get the centroids of the microclusters?* >> > > By "microclusters" do you mean sub-clusters? If you are interested in the > leaves subclusters see the Birch.subcluster_centers_ parameter. > > Otherwise if you want all the centroids in the hierarchy of subclusters, > you can browse the hierarchical tree via the Birch.root_ attribute then > look at _CFSubcluster.centroid_ for each subcluster. > > Hope this helps, > -- > Roman > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: error.png Type: image/png Size: 74377 bytes Desc: not available URL: From betatim at gmail.com Mon Jul 3 10:11:40 2017 From: betatim at gmail.com (Tim Head) Date: Mon, 03 Jul 2017 14:11:40 +0000 Subject: [scikit-learn] Scikit-learn workshop and sprint at EuroScipy 2017 in Erlangen In-Reply-To: References: Message-ID: Hey, On Wed, Jun 28, 2017 at 9:42 AM Olivier Grisel wrote: > > > Do you have any suggestion ? The workshop duration is 90 min. > Looks like a good setup. Two thoughts: should we construct an example that uses a pipeline to illustrate the point that you should put your whole pipeline into your grid search/CV? Start with intro to scikit-learn slides, then live demo, and if there is time left the what's new? 90minutes isn't very long :-/ T -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at gmail.com Mon Jul 3 16:46:03 2017 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Mon, 3 Jul 2017 23:46:03 +0300 Subject: [scikit-learn] Construct the microclusters using a CF-Tree In-Reply-To: References: <75468a69-ba3a-ca8a-7b1c-b477f7d6f08e@gmail.com> Message-ID: Hello Sema, as far as I can tell, in your dataset you has n_samples=65909, n_features=539. Clustering high dimensional data is problematic for a number of reasons, https://en.wikipedia.org/wiki/Clustering_high-dimensional_data#Problems besides the BIRCH implementation doesn't scale well for n_features >> 50 (see for instance the discussion in the second part of https://github.com/scikit-learn/scikit-learn/pull/8808#issuecomment-300776216 also in ). As a workaround for the memory error, you could try using the out-of-core version of Birch (using `partial_fit` on chunks of the dataset, instead of `fit`) but in any case it might also be better to reduce dimensionality beforehand (e.g. with PCA), if that's acceptable. Also the threshold parameter may need to be increased: since in your dataset it looks like the Euclidean distances are more in the 1-10 range? -- Roman On 03/07/17 17:09, Sema Atasever wrote: > Dear Roman, > > When I try the code with the original data (*data.dat*) as you > suggested, I get the following error : *Memory Error* --> (*error.png*), > how can i overcome this problem, thank you so much in advance. > ? > data.dat > > ? > > On Fri, Jun 30, 2017 at 5:42 PM, Roman Yurchak > wrote: > > Hello Sema, > > On 30/06/17 17:14, Sema Atasever wrote: > > I want to cluster them using Birch clustering algorithm. > Does this method have 'precomputed' option. > > > No it doesn't, see > http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html > > so you would need to provide it with the original features matrix > (not the precomputed distance matrix). Since your dataset is fairly > small, there is no reason in precomputing it anyway. > > I needed train an SVM on the centroids of the microclusters so > *How can i get the centroids of the microclusters?* > > > By "microclusters" do you mean sub-clusters? If you are interested > in the leaves subclusters see the Birch.subcluster_centers_ parameter. > > Otherwise if you want all the centroids in the hierarchy of > subclusters, you can browse the hierarchical tree via the > Birch.root_ attribute then look at _CFSubcluster.centroid_ for each > subcluster. > > Hope this helps, > -- > Roman > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > From goix.nicolas at gmail.com Wed Jul 5 05:06:49 2017 From: goix.nicolas at gmail.com (Nicolas Goix) Date: Wed, 5 Jul 2017 11:06:49 +0200 Subject: [scikit-learn] Machine learning for PU data In-Reply-To: References: Message-ID: Hello, As mentioned by Roman, you can try the one-class scikit-learn algorithms such as OneClassSVM, IsolationForest, LocalOutlierFactor (with the private predict method) or EllipticEnvelope. Hope this helps Nicolas On Fri, Jun 30, 2017 at 3:39 PM, Roman Yurchak wrote: > Hello Ruchika, > > I don't think that scikit-learn currently has algorithms that can train > with positive and unlabeled class labels only. However, you could try one > of the following compatible wrappers, > - http://nktmemo.github.io/jekyll/update/2015/11/07/pu_classif > ication.html > - https://github.com/scikit-learn/scikit-learn/pull/371 > > (haven't tried them myself). > > Also, you could try one class SVM as suggested here > https://stackoverflow.com/questions/25700724/binary-semi- > supervised-classification-with-positive-only-and-unlabeled-data-set > > -- > Roman > > > > > On 30/06/17 16:06, Ruchika Nayyar wrote: > >> Hi All, >> >> I am a scikit-learn user and have a question for the community, if >> anyone has applied any available machine learning algorithms in the >> scikit-learn package for data with positive and unlabeled class only? If >> so would you share some insight with me. I understand this could be a >> broader topic but I am new to analyzing PU data and hence can use some >> help. >> >> Thanks, >> Ruchika >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From s.atasever at gmail.com Wed Jul 5 06:27:58 2017 From: s.atasever at gmail.com (Sema Atasever) Date: Wed, 5 Jul 2017 13:27:58 +0300 Subject: [scikit-learn] Construct the microclusters using a CF-Tree In-Reply-To: References: <75468a69-ba3a-ca8a-7b1c-b477f7d6f08e@gmail.com> Message-ID: Hi Roman, I reduced my original data set with feature selection, it has now n_samples=10467, n_features=23. I tried clustering with Birch algorithm this time it worked. I obtained 35 clusters for the reduced dataset in the attachment(data2.dat). How can i know which cluster member represents best each cluster? For example Cluster 0 has 5 member which are : 1, 2, 3, 28 and 29. rows in the data set. Which cluster member (1, 2, 3, 28 or 29) represents best Cluster 0 ? In the birch code i use this code line: *centroids = brc.subcluster_centers_* How do I interpret this line of code output? Thank you so much for your help. *Birch Code:* from sklearn.cluster import Birch from io import StringIO import numpy as np X=np.loadtxt(open("C:\data2.dat", "rb"), delimiter=",") brc = Birch(branching_factor=50, n_clusters=None, threshold=0.5,compute_labels=True,copy=True) brc.fit(X) centroids = brc.subcluster_centers_ labels = brc.subcluster_labels_ brc.predict(X) print("\n brc.predict(X)") print(brc.predict(X)) print("\n centroids") print(centroids) print("\n labels") print(labels) On Mon, Jul 3, 2017 at 11:46 PM, Roman Yurchak wrote: > Hello Sema, > > as far as I can tell, in your dataset you has n_samples=65909, > n_features=539. Clustering high dimensional data is problematic for a > number of reasons, https://en.wikipedia.org/wiki/ > Clustering_high-dimensional_data#Problems > > besides the BIRCH implementation doesn't scale well for n_features >> 50 > (see for instance the discussion in the second part of > https://github.com/scikit-learn/scikit-learn/pull/8808#issue > comment-300776216 also in ). > > As a workaround for the memory error, you could try using the out-of-core > version of Birch (using `partial_fit` on chunks of the dataset, instead of > `fit`) but in any case it might also be better to reduce dimensionality > beforehand (e.g. with PCA), if that's acceptable. Also the threshold > parameter may need to be increased: since in your dataset it looks like the > Euclidean distances are more in the 1-10 range? > > -- > Roman > > > On 03/07/17 17:09, Sema Atasever wrote: > >> Dear Roman, >> >> When I try the code with the original data (*data.dat*) as you >> suggested, I get the following error : *Memory Error* --> (*error.png*), >> how can i overcome this problem, thank you so much in advance. >> ? >> data.dat >> > k/view?usp=drive_web> >> ? >> >> On Fri, Jun 30, 2017 at 5:42 PM, Roman Yurchak > > wrote: >> >> Hello Sema, >> >> On 30/06/17 17:14, Sema Atasever wrote: >> >> I want to cluster them using Birch clustering algorithm. >> Does this method have 'precomputed' option. >> >> >> No it doesn't, see >> http://scikit-learn.org/stable/modules/generated/sklearn. >> cluster.Birch.html >> > cluster.Birch.html> >> so you would need to provide it with the original features matrix >> (not the precomputed distance matrix). Since your dataset is fairly >> small, there is no reason in precomputing it anyway. >> >> I needed train an SVM on the centroids of the microclusters so >> *How can i get the centroids of the microclusters?* >> >> >> By "microclusters" do you mean sub-clusters? If you are interested >> in the leaves subclusters see the Birch.subcluster_centers_ parameter. >> >> Otherwise if you want all the centroids in the hierarchy of >> subclusters, you can browse the hierarchical tree via the >> Birch.root_ attribute then look at _CFSubcluster.centroid_ for each >> subcluster. >> >> Hope this helps, >> -- >> Roman >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: screen_shot.png Type: image/png Size: 103493 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: data2.dat Type: application/octet-stream Size: 18776 bytes Desc: not available URL: From axelbreuer at yahoo.com Thu Jul 6 05:48:23 2017 From: axelbreuer at yahoo.com (axel breuer) Date: Thu, 6 Jul 2017 09:48:23 +0000 (UTC) Subject: [scikit-learn] Typo in online documentation on Matrix Factorization References: <1582496995.5548742.1499334503648.ref@mail.yahoo.com> Message-ID: <1582496995.5548742.1499334503648@mail.yahoo.com> Hi, First at all, I would like to warmly thank the scikit developer community with providing us with such a high quality ML library: it really became an amazing piece of scientific software. I have a comment concerning the online documentation on Matrix Factorization Problems. (I use this mailing list because I could not find in your online howto, what is the best channel to communicate documentation issues.Apologies if this email is considered as spam in this mailing list !) On the webpage?2.5. Decomposing signals in components (matrix factorization problems) ? scikit-learn 0.18.2 documentation We can read at?2.5.1.5. Sparse principal components analysis but a bit further, at?2.5.3.2. Generic dictionary learning, we can read The notations are obviously inconsistent as U and V have been interchanged some how. Two extra (less important) corrections could probably improve even further the clarity for the reader:1. Sticking to a single upper bound limit (either n_components or n_atoms)2. Specifying whether V_k are columns or rows (maybe using a notation ? la Matlab/Numpy: V_{:,k} or V_{k,:}) Kind regards, Axel BREUER -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: blob.jpg Type: image/png Size: 12836 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: blob.jpg Type: image/png Size: 16303 bytes Desc: not available URL: From olivier.grisel at ensta.org Thu Jul 6 09:11:55 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Thu, 6 Jul 2017 15:11:55 +0200 Subject: [scikit-learn] Typo in online documentation on Matrix Factorization In-Reply-To: References: <1582496995.5548742.1499334503648.ref@mail.yahoo.com> <1582496995.5548742.1499334503648@mail.yahoo.com> Message-ID: 2017-07-06 15:10 GMT+02:00 Olivier Grisel : > (and just make sure that the "components" is a synonym for "dictionary > atoms" in the literature). Actually I meant: and just make sure that our documentation states explicitly that the "components" is a synonym for "dictionary atoms" in the literature. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel From olivier.grisel at ensta.org Thu Jul 6 09:10:26 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Thu, 6 Jul 2017 15:10:26 +0200 Subject: [scikit-learn] Typo in online documentation on Matrix Factorization In-Reply-To: <1582496995.5548742.1499334503648@mail.yahoo.com> References: <1582496995.5548742.1499334503648.ref@mail.yahoo.com> <1582496995.5548742.1499334503648@mail.yahoo.com> Message-ID: I think the documentation is correct. U, a.k.a. "the code" or "the activations" has shape (n_samples, n_components) and V a.k.a. "the dictionary" or "the components" has shape (n_components, n_features) in both case. We could use n_components uniformly instead of n_atoms for consistency's sake (and just make sure that the "components" is a synonym for "dictionary atoms" in the literature). I think V_k is fine because the dimension with size n_components is the first dimension of V. ? If you spot issues or other things that are unclear or incomplete in the doc, please feel free to open an issue on github. You can also directly submit a pull request if you are familiar with git. The website is built from the docs that live in the "doc/" subfolder of the repo. -- Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: From greina at eng.ucsd.edu Thu Jul 6 12:05:38 2017 From: greina at eng.ucsd.edu (G Reina) Date: Thu, 6 Jul 2017 09:05:38 -0700 Subject: [scikit-learn] Replacing the Boston Housing Prices dataset Message-ID: I'd like to request that the "Boston Housing Prices" dataset in sklearn (sklearn.datasets.load_boston) be replaced with the "Ames Housing Prices" dataset (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am willing to submit the code change if the developers agree. The Boston dataset has the feature "Bk is the proportion of blacks in town". It is an incredibly racist "feature" to include in any dataset. I think is beneath us as data scientists. I submit that the Ames dataset is a viable alternative for learning regression. The author has shown that the dataset is a more robust replacement for Boston. Ames is a 2011 regression dataset on housing prices and has more than 5 times the amount of training examples with over 7 times as many features (none of which are morally questionable). I welcome the community's thoughts on the matter. Thanks. -Tony Here's an article I wrote on the Boston dataset: https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D -------------- next part -------------- An HTML attachment was scrubbed... URL: From hershil at gmail.com Thu Jul 6 12:25:28 2017 From: hershil at gmail.com (Vikas Kumar) Date: Thu, 6 Jul 2017 21:55:28 +0530 Subject: [scikit-learn] Which algorithm is used in sklearn SGDClassifier when modified huber loss is used? In-Reply-To: References: Message-ID: The documentation says: The loss function to be used. Defaults to ?hinge?, which gives a linear SVM. The ?log? loss gives logistic regression, a probabilistic classifier. ?modified_huber? is another smooth loss that brings tolerance to outliers as well as probability estimates. When we use 'modified_huber' loss function, which classification algorithm is used? Is it SVM? If yes, how come it is able to give probability estimates, which is something it can't do with hinge loss? Regards, Vikas -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Thu Jul 6 12:31:15 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 6 Jul 2017 12:31:15 -0400 Subject: [scikit-learn] Replacing the Boston Housing Prices dataset In-Reply-To: References: Message-ID: Hi Tony. I don't think it's a good idea to remove the dataset, given how many tutorials and examples rely on it. I also don't think it's a good idea to ignore racial discrimination, which I guess this feature is trying to capture. I was recently asked to remove an excerpt from a dataset from my slide, as it was "too racist". It was randomly sampled data from the adult census dataset. Unfortunately, economics in the US are not color blind (yet), and the reality is racist. I haven't done an in-depth analysis on whether this feature is actually informative, but I don't think your analysis is conclusive. Including ethnicity in data actually allows us to ensure "fairness" in certain decision making processes. Without collecting this data, it would be impossible to ensure automatic decisions are not influenced by past human biases. Arguably that's not what the authors of this dataset are doing. Check out http://www.fatml.org/ for more on fairness in machine learning and data science. Cheers, Andy On 07/06/2017 12:05 PM, G Reina wrote: > I'd like to request that the "Boston Housing Prices" dataset in > sklearn (sklearn.datasets.load_boston) be replaced with the "Ames > Housing Prices" dataset > (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am > willing to submit the code change if the developers agree. > > The Boston dataset has the feature "Bk is the proportion of blacks in > town". It is an incredibly racist "feature" to include in any dataset. > I think is beneath us as data scientists. > > I submit that the Ames dataset is a viable alternative for learning > regression. The author has shown that the dataset is a more robust > replacement for Boston. Ames is a 2011 regression dataset on housing > prices and has more than 5 times the amount of training examples with > over 7 times as many features (none of which are morally questionable). > > I welcome the community's thoughts on the matter. > > Thanks. > -Tony > > Here's an article I wrote on the Boston dataset: > https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From shane.grigsby at colorado.edu Thu Jul 6 12:32:57 2017 From: shane.grigsby at colorado.edu (Shane Grigsby) Date: Thu, 6 Jul 2017 10:32:57 -0600 Subject: [scikit-learn] Agglomerative Clustering without knowing number of clusters In-Reply-To: References: Message-ID: <20170706163257.zgvwnoih5zjb73io@MacBook-Pro-3.local> This sounds like it may be a problem more amenable to either DBSCAN or OPTICS. Both algorithms don't require a priori knowledge of the number of clusters, and both let you specify a minimum point membership threshold for cluster membership. The OPTICS algorithm will also produce a dendrogram that you can cut for sub clusters if need be. DBSCAN is part of the stable release and has been for some time; OPTICS is pending as a pull request, but it's stable and you can try it if you like: https://github.com/scikit-learn/scikit-learn/pull/1984 Cheers, Shane On 06/30, Ariani A wrote: >I want to perform agglomerative clustering, but I have no idea of number of >clusters before hand. But I want that every cluster has at least 40 data >points in it. How can I apply this to sklearn.agglomerative clustering? >Should I use dendrogram and cut it somehow? I have no idea how to relate >dendrogram to this and cutting it out. Any help will be appreciated! >_______________________________________________ >scikit-learn mailing list >scikit-learn at python.org >https://mail.python.org/mailman/listinfo/scikit-learn -- *PhD candidate & Research Assistant* *Cooperative Institute for Research in Environmental Sciences (CIRES)* *University of Colorado at Boulder* From b.noushin7 at gmail.com Thu Jul 6 12:39:05 2017 From: b.noushin7 at gmail.com (Ariani A) Date: Thu, 6 Jul 2017 12:39:05 -0400 Subject: [scikit-learn] Agglomerative Clustering without knowing number of clusters In-Reply-To: <20170706163257.zgvwnoih5zjb73io@MacBook-Pro-3.local> References: <20170706163257.zgvwnoih5zjb73io@MacBook-Pro-3.local> Message-ID: Dear Shane, Thanks for your time. But I have to implement it by agglomerative clustering and cut it when each cluster has at least 40 data points. But I am not sure how to do cut it. I was guessing maybe it can be done by cutting the dandrogram? Is it correct? If so, I do not know how to apply it. Could you give me a point? Best, Ariani On Thu, Jul 6, 2017 at 12:32 PM, Shane Grigsby wrote: > This sounds like it may be a problem more amenable to either DBSCAN or > OPTICS. Both algorithms don't require a priori knowledge of the number of > clusters, and both let you specify a minimum point membership threshold for > cluster membership. The OPTICS algorithm will also produce a dendrogram > that you can cut for sub clusters if need be. > > DBSCAN is part of the stable release and has been for some time; OPTICS is > pending as a pull request, but it's stable and you can try it if you like: > > https://github.com/scikit-learn/scikit-learn/pull/1984 > > Cheers, > Shane > > > On 06/30, Ariani A wrote: > >> I want to perform agglomerative clustering, but I have no idea of number >> of >> clusters before hand. But I want that every cluster has at least 40 data >> points in it. How can I apply this to sklearn.agglomerative clustering? >> Should I use dendrogram and cut it somehow? I have no idea how to relate >> dendrogram to this and cutting it out. Any help will be appreciated! >> > > _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > *PhD candidate & Research Assistant* > *Cooperative Institute for Research in Environmental Sciences (CIRES)* > *University of Colorado at Boulder* > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From greina at eng.ucsd.edu Thu Jul 6 12:41:19 2017 From: greina at eng.ucsd.edu (G Reina) Date: Thu, 6 Jul 2017 09:41:19 -0700 Subject: [scikit-learn] Replacing the Boston Housing Prices dataset In-Reply-To: References: Message-ID: Wow. I completely disagree. The fact that too many tutorials and examples rely on it is not a reason to keep the dataset. New tutorials are written all the time. And, as sklearn evolves some of the existing tutorials will need to be updated anyway to keep up with the changes. Including "ethnicity" is completely illegal in making business decisions in the United States. For example, credit scoring systems bend over backward to expunge even proxy features that could be highly correlated with race (for example, they can't include neighborhood, but can include entire counties). Let's leave the studying of racism to actual scientists who study racism. Not to toy datasets that we use to teach our students about a completely unrelated matter like regression. -Tony On Thu, Jul 6, 2017 at 9:31 AM, Andreas Mueller wrote: > Hi Tony. > > I don't think it's a good idea to remove the dataset, given how many > tutorials and examples rely on it. > I also don't think it's a good idea to ignore racial discrimination, which > I guess this feature is trying to capture. > > I was recently asked to remove an excerpt from a dataset from my slide, as > it was "too racist". It was randomly sampled > data from the adult census dataset. Unfortunately, economics in the US are > not color blind (yet), and the reality is racist. > I haven't done an in-depth analysis on whether this feature is actually > informative, but I don't think your analysis is conclusive. > > Including ethnicity in data actually allows us to ensure "fairness" in > certain decision making processes. > Without collecting this data, it would be impossible to ensure automatic > decisions are not influenced > by past human biases. Arguably that's not what the authors of this dataset > are doing. > > Check out http://www.fatml.org/ for more on fairness in machine learning > and data science. > > Cheers, > Andy > > > > On 07/06/2017 12:05 PM, G Reina wrote: > > I'd like to request that the "Boston Housing Prices" dataset in sklearn > (sklearn.datasets.load_boston) be replaced with the "Ames Housing Prices" > dataset (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am > willing to submit the code change if the developers agree. > > The Boston dataset has the feature "Bk is the proportion of blacks in > town". It is an incredibly racist "feature" to include in any dataset. I > think is beneath us as data scientists. > > I submit that the Ames dataset is a viable alternative for learning > regression. The author has shown that the dataset is a more robust > replacement for Boston. Ames is a 2011 regression dataset on housing prices > and has more than 5 times the amount of training examples with over 7 times > as many features (none of which are morally questionable). > > I welcome the community's thoughts on the matter. > > Thanks. > -Tony > > Here's an article I wrote on the Boston dataset: > https://www.linkedin.com/pulse/hidden-racism-data- > science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_ > flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D > > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrewholmes82 at icloud.com Thu Jul 6 12:19:49 2017 From: andrewholmes82 at icloud.com (Andrew Holmes) Date: Thu, 06 Jul 2017 17:19:49 +0100 Subject: [scikit-learn] Replacing the Boston Housing Prices dataset In-Reply-To: References: Message-ID: But how do social scientists do research into racism without including ethnicity as a feature in the data? Best wishes Andrew Public Profile > On 6 Jul 2017, at 17:05, G Reina wrote: > > I'd like to request that the "Boston Housing Prices" dataset in sklearn (sklearn.datasets.load_boston) be replaced with the "Ames Housing Prices" dataset (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf ). I am willing to submit the code change if the developers agree. > > The Boston dataset has the feature "Bk is the proportion of blacks in town". It is an incredibly racist "feature" to include in any dataset. I think is beneath us as data scientists. > > I submit that the Ames dataset is a viable alternative for learning regression. The author has shown that the dataset is a more robust replacement for Boston. Ames is a 2011 regression dataset on housing prices and has more than 5 times the amount of training examples with over 7 times as many features (none of which are morally questionable). > > I welcome the community's thoughts on the matter. > > Thanks. > -Tony > > Here's an article I wrote on the Boston dataset: > https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffrey.m.allard at gmail.com Thu Jul 6 13:38:02 2017 From: jeffrey.m.allard at gmail.com (jma) Date: Thu, 6 Jul 2017 13:38:02 -0400 Subject: [scikit-learn] Replacing the Boston Housing Prices dataset In-Reply-To: References: Message-ID: I work in the financial services industry and build machine learning models for marketing applications. We put an enormous effort (multiple layers of oversight and governance) into ensuring that our models are free of bias against protected classes etc. Having data describing race and ethnicity (among others) is extremely important to validate this is indeed the case. Without it, you have no such assurance. On 07/06/2017 12:19 PM, Andrew Holmes wrote: > But how do social scientists do research into racism without including > ethnicity as a feature in the data? > > Best wishes > Andrew > > Public Profile > > >> On 6 Jul 2017, at 17:05, G Reina > > wrote: >> >> I'd like to request that the "Boston Housing Prices" dataset in >> sklearn (sklearn.datasets.load_boston) be replaced with the "Ames >> Housing Prices" dataset >> (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am >> willing to submit the code change if the developers agree. >> >> The Boston dataset has the feature "Bk is the proportion of blacks in >> town". It is an incredibly racist "feature" to include in any >> dataset. I think is beneath us as data scientists. >> >> I submit that the Ames dataset is a viable alternative for learning >> regression. The author has shown that the dataset is a more robust >> replacement for Boston. Ames is a 2011 regression dataset on housing >> prices and has more than 5 times the amount of training examples with >> over 7 times as many features (none of which are morally questionable). >> >> I welcome the community's thoughts on the matter. >> >> Thanks. >> -Tony >> >> Here's an article I wrote on the Boston dataset: >> https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Thu Jul 6 14:09:10 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 6 Jul 2017 14:09:10 -0400 Subject: [scikit-learn] Replacing the Boston Housing Prices dataset In-Reply-To: References: Message-ID: <132fe6c2-a62f-fc72-0a95-c9fac7c440b3@gmail.com> On 07/06/2017 12:41 PM, G Reina wrote: > > The fact that too many tutorials and examples rely on it is not a > reason to keep the dataset. New tutorials are written all the time. > And, as sklearn evolves some of the existing tutorials will need to be > updated anyway to keep up with the changes. No, we try to avoid that as much as possible. Old examples should work for as long as possible, and we actively avoid breaking API unnecessarily. It's one of the core principles of scikit-learn development. And new tutorials can use any dataset they choose. We are working on including an openml fetcher, which allows using more datasets more easily. From sean.violante at gmail.com Thu Jul 6 15:08:33 2017 From: sean.violante at gmail.com (Sean Violante) Date: Thu, 6 Jul 2017 21:08:33 +0200 Subject: [scikit-learn] Replacing the Boston Housing Prices dataset In-Reply-To: References: Message-ID: G Reina you make a bizarre argument. You argue that you should not even check racism as a possible factor in house prices? But then you yourself check whether its relevant Then you say "but I'd argue that it's more due to the location (near water, near businesses, near restaurants, near parks and recreation) than to the ethnic makeup" Which was basically what the original authors wanted to show too, Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. but unless you measure ethnic make-up you cannot show that it is not a confounder. The term "white flight" refers to affluent white families moving to the suburbs.. And clearly a question is whether/how much was racism or avoiding air pollution. On 6 Jul 2017 6:10 pm, "G Reina" wrote: > I'd like to request that the "Boston Housing Prices" dataset in sklearn > (sklearn.datasets.load_boston) be replaced with the "Ames Housing Prices" > dataset (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am > willing to submit the code change if the developers agree. > > The Boston dataset has the feature "Bk is the proportion of blacks in > town". It is an incredibly racist "feature" to include in any dataset. I > think is beneath us as data scientists. > > I submit that the Ames dataset is a viable alternative for learning > regression. The author has shown that the dataset is a more robust > replacement for Boston. Ames is a 2011 regression dataset on housing prices > and has more than 5 times the amount of training examples with over 7 times > as many features (none of which are morally questionable). > > I welcome the community's thoughts on the matter. > > Thanks. > -Tony > > Here's an article I wrote on the Boston dataset: > https://www.linkedin.com/pulse/hidden-racism-data- > science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_ > flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtcunni at gmail.com Thu Jul 6 15:50:42 2017 From: jtcunni at gmail.com (jt cunni) Date: Thu, 6 Jul 2017 14:50:42 -0500 Subject: [scikit-learn] Moving average transformer In-Reply-To: References: Message-ID: First off, I have never contributed to anything before so please have patience with me. I am a data scientist and I have been working with doing some feature engineering on one of my datasets. In my code, I have a pipeline of several transformers and an estimator. I use my pipeline and randomizedsearchcv to tune my hyper-parameters and my transformer settings. Pretty standard stuff. One thing I was doing was creating a feature that was a moving average of another feature. In a basic example, imagine I want to predict if a team is going to win a baseball game. I create a feature that is the moving average of the last N games of runs scored per game (this is the window size of the moving average). Not knowing what the best window size for the moving average, I created a custom transformer that could be put in a pipeline to find the window size that provides the most lift. Is there any interest for this type of contribution? If so, what unittests or anything else do I need to provide? Thanks, Jeremy -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at gmail.com Thu Jul 6 15:59:48 2017 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Thu, 6 Jul 2017 22:59:48 +0300 Subject: [scikit-learn] Construct the microclusters using a CF-Tree In-Reply-To: References: <75468a69-ba3a-ca8a-7b1c-b477f7d6f08e@gmail.com> Message-ID: Hello Sema, On 05/07/17 13:27, Sema Atasever wrote: > How can i know which cluster member represents best each cluster? You could try to pick the one that's closest to the cluster centroid.. > In the birch code i use this code line: *centroids = > brc.subcluster_centers_* > How do I interpret this line of code output? It is supposed to give your the centroid of each leaf node (computed in https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/cluster/birch.py#L472). I would just recompute the centroid from the labels, though, with X[brc.labels_==k, :].mean() for k in np.unique(brc.labels_) to be sure of the results... -- Roman From jmschreiber91 at gmail.com Thu Jul 6 16:03:41 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Thu, 6 Jul 2017 13:03:41 -0700 Subject: [scikit-learn] Moving average transformer In-Reply-To: References: Message-ID: Hi Jeremy! Thanks for your offer to contribute. We're always looking for people to add good ideas to the package. Time series data can be tricky to handle appropriately, and so I think we generally try to pass it off to more specialized packages that focus on that. Andreas may have a more detailed perspective on this though. Jacob On Thu, Jul 6, 2017 at 12:50 PM, jt cunni wrote: > First off, I have never contributed to anything before so please have > patience with me. I am a data scientist and I have been working with doing > some feature engineering on one of my datasets. In my code, I have a > pipeline of several transformers and an estimator. I use my pipeline > and randomizedsearchcv to tune my hyper-parameters and my transformer > settings. Pretty standard stuff. One thing I was doing was creating a > feature that was a moving average of another feature. In a basic example, > imagine I want to predict if a team is going to win a baseball game. I > create a feature that is the moving average of the last N games of runs > scored per game (this is the window size of the moving average). Not > knowing what the best window size for the moving average, I created a > custom transformer that could be put in a pipeline to find the window size > that provides the most lift. Is there any interest for this type of > contribution? If so, what unittests or anything else do I need to provide? > > > > Thanks, > > Jeremy > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Thu Jul 6 16:34:51 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Thu, 6 Jul 2017 13:34:51 -0700 Subject: [scikit-learn] Replacing the Boston Housing Prices dataset In-Reply-To: References: Message-ID: Hi Tony As others have pointed out, I think that you may be misunderstanding the purpose of that "feature." We are in agreement that discrimination against protected classes is not OK, and that even outside complying with the law one should avoid discrimination, in model building or elsewhere. However, I disagree that one does this by eliminating from all datasets any feature that may allude to these protected classes. As Andreas pointed out, there is a growing effort to ensure that machine learning models are fair and benefit the common good (such as FATML, DSSG, etc..), and from my understanding the general consensus isn't necessarily that simply eliminating the feature is sufficient. I think we are in agreement that naively learning a model over a feature set containing questionable features and calling it a day is not okay, but as others have pointed out, having these features present and handling them appropriately can help guard against the model implicitly learning unfair biases (even if they are not explicitly exposed to the feature). I would welcome the addition of the Ames dataset to the ones supported by sklearn, but I'm not convinced that the Boston dataset should be removed. As Andreas pointed out, there is a benefit to having canonical examples present so that beginners can easily follow along with the many tutorials that have been written using them. As Sean points out, the paper itself is trying to pull out the connection between house price and clean air in the presence of possible confounding variables. In a more general sense, saying that a feature shouldn't be there because a simple linear regression is unaffected by the results is a bit odd because it is very common for datasets to include irrelevant features, and handling them appropriately is important. In addition, one could argue that having this type of issue arise in a toy dataset has a benefit because it exposes these types of issues to those learning data science earlier on and allows them to keep these issues in mind in the future when the data is more serious. It is important for us all to keep issues of fairness in mind when it comes to data science. I'm glad that you're speaking out in favor of fairness and trying to bring attention to it. Jacob On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante wrote: > G Reina > you make a bizarre argument. You argue that you should not even check > racism as a possible factor in house prices? > > But then you yourself check whether its relevant > Then you say > > "but I'd argue that it's more due to the location (near water, near > businesses, near restaurants, near parks and recreation) than to the ethnic > makeup" > > Which was basically what the original authors wanted to show too, > > Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean > air', J. Environ. Economics & Management, vol.5, 81-102, 1978. > > but unless you measure ethnic make-up you cannot show that it is not a > confounder. > > The term "white flight" refers to affluent white families moving to the > suburbs.. And clearly a question is whether/how much was racism or avoiding > air pollution. > > > > > > On 6 Jul 2017 6:10 pm, "G Reina" wrote: > >> I'd like to request that the "Boston Housing Prices" dataset in sklearn >> (sklearn.datasets.load_boston) be replaced with the "Ames Housing Prices" >> dataset (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am >> willing to submit the code change if the developers agree. >> >> The Boston dataset has the feature "Bk is the proportion of blacks in >> town". It is an incredibly racist "feature" to include in any dataset. I >> think is beneath us as data scientists. >> >> I submit that the Ames dataset is a viable alternative for learning >> regression. The author has shown that the dataset is a more robust >> replacement for Boston. Ames is a 2011 regression dataset on housing prices >> and has more than 5 times the amount of training examples with over 7 times >> as many features (none of which are morally questionable). >> >> I welcome the community's thoughts on the matter. >> >> Thanks. >> -Tony >> >> Here's an article I wrote on the Boston dataset: >> https://www.linkedin.com/pulse/hidden-racism-data-science-g- >> anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_ >> feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Thu Jul 6 18:33:42 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Fri, 7 Jul 2017 08:33:42 +1000 Subject: [scikit-learn] Moving average transformer In-Reply-To: References: Message-ID: I agree that this is best handled with a custom transformer, for the reasons cited by Jacob, but also because it sounds like this transformer does not gather statistics from the training data, and so can be implemented with FunctionTransformer On 7 Jul 2017 6:10 am, "Jacob Schreiber" wrote: Hi Jeremy! Thanks for your offer to contribute. We're always looking for people to add good ideas to the package. Time series data can be tricky to handle appropriately, and so I think we generally try to pass it off to more specialized packages that focus on that. Andreas may have a more detailed perspective on this though. Jacob On Thu, Jul 6, 2017 at 12:50 PM, jt cunni wrote: > First off, I have never contributed to anything before so please have > patience with me. I am a data scientist and I have been working with doing > some feature engineering on one of my datasets. In my code, I have a > pipeline of several transformers and an estimator. I use my pipeline > and randomizedsearchcv to tune my hyper-parameters and my transformer > settings. Pretty standard stuff. One thing I was doing was creating a > feature that was a moving average of another feature. In a basic example, > imagine I want to predict if a team is going to win a baseball game. I > create a feature that is the moving average of the last N games of runs > scored per game (this is the window size of the moving average). Not > knowing what the best window size for the moving average, I created a > custom transformer that could be put in a pipeline to find the window size > that provides the most lift. Is there any interest for this type of > contribution? If so, what unittests or anything else do I need to provide? > > > > Thanks, > > Jeremy > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From jni.soma at gmail.com Thu Jul 6 19:36:41 2017 From: jni.soma at gmail.com (Juan Nunez-Iglesias) Date: Fri, 7 Jul 2017 09:36:41 +1000 Subject: [scikit-learn] Replacing the Boston Housing Prices dataset In-Reply-To: References: Message-ID: For what it's worth: I'm sympathetic to the argument that you can't fix the problem if you don't measure it, but I agree with Tony that "many tutorials use it" is an extremely weak argument. We removed Lena from scikit-image because it was the right thing to do. I very much doubt that Boston house prices is in more widespread use than Lena was in image processing. You can argue about whether or not it's morally right or wrong to include the dataset. I see merit to both arguments. But "too many tutorials use it" is very similar in flavour to "the economy of the South would collapse without slavery." Regarding fair uses of the feature, I would hope that all sklearn tutorials using the dataset mention such uses. The potential for abuse and misinterpretation is enormous. On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber , wrote: > Hi Tony > > As others have pointed out, I think that you may be misunderstanding the purpose of that "feature." We are in agreement that discrimination against protected classes is not OK, and that even outside complying with the law one should avoid discrimination, in model building or elsewhere. However, I disagree that one does this by eliminating from all datasets any feature that may allude to these protected classes. As Andreas pointed out, there is a growing effort to ensure that machine learning models are fair and benefit the common good (such as FATML, DSSG, etc..), and from my understanding the general consensus isn't necessarily that simply eliminating the feature is sufficient. I think we are in agreement that naively learning a model over a feature set containing questionable features and calling it a day is not okay, but as others have pointed out, having these features present and handling them appropriately can help guard against the model implicitly learning unfair biases (even if they are not explicitly exposed to the feature). > > I would welcome the addition of the Ames dataset to the ones supported by sklearn, but I'm not convinced that the Boston dataset should be removed. As Andreas pointed out, there is a benefit to having canonical examples present so that beginners can easily follow along with the many tutorials that have been written using them. As Sean points out, the paper itself is trying to pull out the connection between house price and clean air in the presence of possible confounding variables. In a more general sense, saying that a feature shouldn't be there because a simple linear regression is unaffected by the results is a bit odd because it is very common for datasets to include irrelevant features, and handling them appropriately is important. In addition, one could argue that having this type of issue arise in a toy dataset has a benefit because it exposes these types of issues to those learning data science earlier on and allows them to keep these issues in mind in the future when the data is more serious. > > It is important for us all to keep issues of fairness in mind when it comes to data science. I'm glad that you're speaking out in favor of fairness and trying to bring attention to it. > > Jacob > > > On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante wrote: > > > G Reina > > > you make a bizarre argument. You argue that you should not even check racism as a possible factor in house prices? > > > > > > But then you yourself check whether its relevant > > > Then you say > > > > > > "but I'd argue that it's more due to the location (near water, near businesses, near restaurants, near parks and recreation) than to the ethnic makeup" > > > > > > Which ?was basically what ?the original authors wanted to show too, > > > > > > Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. > > > > > > ?but unless you measure ethnic make-up you cannot show that it is not a confounder. > > > > > > The term "white flight" refers to affluent white families moving to the suburbs.. And clearly a question is whether/how much was racism or avoiding air pollution. > > > > > > > > > > > > > > > > > > > On 6 Jul 2017 6:10 pm, "G Reina" wrote: > > > > > I'd like to request that the "Boston Housing Prices" dataset in sklearn (sklearn.datasets.load_boston) be replaced with the "Ames Housing Prices" dataset (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am willing to submit the code change if the developers agree. > > > > > > > > > > The Boston dataset has the feature "Bk is the proportion of blacks in town". It is an incredibly racist "feature" to include in any dataset. I think is beneath us as data scientists. > > > > > > > > > > I submit that the Ames dataset is a viable alternative for learning regression. The author has shown that the dataset is a more robust replacement for Boston. Ames is a 2011 regression dataset on housing prices and has more than 5 times the amount of training examples with over 7 times as many features (none of which are morally questionable). > > > > > > > > > > I welcome the community's thoughts on the matter. > > > > > > > > > > Thanks. > > > > > -Tony > > > > > > > > > > Here's an article I wrote on the Boston dataset: > > > > > https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D > > > > > > > > > > > > > > > _______________________________________________ > > > > > scikit-learn mailing list > > > > > scikit-learn at python.org > > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Thu Jul 6 20:39:13 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Thu, 6 Jul 2017 20:39:13 -0400 Subject: [scikit-learn] Replacing the Boston Housing Prices dataset In-Reply-To: References: Message-ID: <61B34F59-142E-4851-9B27-7DC2A0C2DAF8@gmail.com> I think there can be some middle ground. I.e., adding a new, simple dataset to demonstrate regression (maybe autmpg, wine quality, or sth like that) and use that for the scikit-learn examples in the main documentation etc but leave the boston dataset in the code base for now. Whether it's a weak argument or not, it would be quite destructive to remove the dataset altogether in the next version or so, not only because old tutorials use it but many unit tests in many different projects depend on it. I think it might be better to phase it out by having a good alternative first, and I am sure that the scikit-learn maintainers wouldn't have anything against it if someone would update the examples/tutorials with the use of different datasets Best, Sebastian > On Jul 6, 2017, at 7:36 PM, Juan Nunez-Iglesias wrote: > > For what it's worth: I'm sympathetic to the argument that you can't fix the problem if you don't measure it, but I agree with Tony that "many tutorials use it" is an extremely weak argument. We removed Lena from scikit-image because it was the right thing to do. I very much doubt that Boston house prices is in more widespread use than Lena was in image processing. > > You can argue about whether or not it's morally right or wrong to include the dataset. I see merit to both arguments. But "too many tutorials use it" is very similar in flavour to "the economy of the South would collapse without slavery." > > Regarding fair uses of the feature, I would hope that all sklearn tutorials using the dataset mention such uses. The potential for abuse and misinterpretation is enormous. > > On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber , wrote: >> Hi Tony >> >> As others have pointed out, I think that you may be misunderstanding the purpose of that "feature." We are in agreement that discrimination against protected classes is not OK, and that even outside complying with the law one should avoid discrimination, in model building or elsewhere. However, I disagree that one does this by eliminating from all datasets any feature that may allude to these protected classes. As Andreas pointed out, there is a growing effort to ensure that machine learning models are fair and benefit the common good (such as FATML, DSSG, etc..), and from my understanding the general consensus isn't necessarily that simply eliminating the feature is sufficient. I think we are in agreement that naively learning a model over a feature set containing questionable features and calling it a day is not okay, but as others have pointed out, having these features present and handling them appropriately can help guard against the model implicitly learning unfair biases (even if they are not explicitly exposed to the feature). >> >> I would welcome the addition of the Ames dataset to the ones supported by sklearn, but I'm not convinced that the Boston dataset should be removed. As Andreas pointed out, there is a benefit to having canonical examples present so that beginners can easily follow along with the many tutorials that have been written using them. As Sean points out, the paper itself is trying to pull out the connection between house price and clean air in the presence of possible confounding variables. In a more general sense, saying that a feature shouldn't be there because a simple linear regression is unaffected by the results is a bit odd because it is very common for datasets to include irrelevant features, and handling them appropriately is important. In addition, one could argue that having this type of issue arise in a toy dataset has a benefit because it exposes these types of issues to those learning data science earlier on and allows them to keep these issues in mind in the future when the data is more serious. >> >> It is important for us all to keep issues of fairness in mind when it comes to data science. I'm glad that you're speaking out in favor of fairness and trying to bring attention to it. >> >> Jacob >> >> On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante wrote: >> G Reina >> you make a bizarre argument. You argue that you should not even check racism as a possible factor in house prices? >> >> But then you yourself check whether its relevant >> Then you say >> >> "but I'd argue that it's more due to the location (near water, near businesses, near restaurants, near parks and recreation) than to the ethnic makeup" >> >> Which was basically what the original authors wanted to show too, >> >> Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. >> >> but unless you measure ethnic make-up you cannot show that it is not a confounder. >> >> The term "white flight" refers to affluent white families moving to the suburbs.. And clearly a question is whether/how much was racism or avoiding air pollution. >> >> >> >> >> >> On 6 Jul 2017 6:10 pm, "G Reina" wrote: >> I'd like to request that the "Boston Housing Prices" dataset in sklearn (sklearn.datasets.load_boston) be replaced with the "Ames Housing Prices" dataset (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am willing to submit the code change if the developers agree. >> >> The Boston dataset has the feature "Bk is the proportion of blacks in town". It is an incredibly racist "feature" to include in any dataset. I think is beneath us as data scientists. >> >> I submit that the Ames dataset is a viable alternative for learning regression. The author has shown that the dataset is a more robust replacement for Boston. Ames is a 2011 regression dataset on housing prices and has more than 5 times the amount of training examples with over 7 times as many features (none of which are morally questionable). >> >> I welcome the community's thoughts on the matter. >> >> Thanks. >> -Tony >> >> Here's an article I wrote on the Boston dataset: >> https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From ross at cgl.ucsf.edu Thu Jul 6 21:00:49 2017 From: ross at cgl.ucsf.edu (Bill Ross) Date: Thu, 6 Jul 2017 18:00:49 -0700 Subject: [scikit-learn] Replacing the Boston Housing Prices dataset In-Reply-To: <61B34F59-142E-4851-9B27-7DC2A0C2DAF8@gmail.com> References: <61B34F59-142E-4851-9B27-7DC2A0C2DAF8@gmail.com> Message-ID: <32b9ea32-b5dc-dfbe-04ca-36e8db30160e@cgl.ucsf.edu> Unless the data concretely promotes discrimination, it seems discriminatory to exclude it. Bill On 7/6/17 5:39 PM, Sebastian Raschka wrote: > I think there can be some middle ground. I.e., adding a new, simple dataset to demonstrate regression (maybe autmpg, wine quality, or sth like that) and use that for the scikit-learn examples in the main documentation etc but leave the boston dataset in the code base for now. Whether it's a weak argument or not, it would be quite destructive to remove the dataset altogether in the next version or so, not only because old tutorials use it but many unit tests in many different projects depend on it. I think it might be better to phase it out by having a good alternative first, and I am sure that the scikit-learn maintainers wouldn't have anything against it if someone would update the examples/tutorials with the use of different datasets > > Best, > Sebastian > >> On Jul 6, 2017, at 7:36 PM, Juan Nunez-Iglesias wrote: >> >> For what it's worth: I'm sympathetic to the argument that you can't fix the problem if you don't measure it, but I agree with Tony that "many tutorials use it" is an extremely weak argument. We removed Lena from scikit-image because it was the right thing to do. I very much doubt that Boston house prices is in more widespread use than Lena was in image processing. >> >> You can argue about whether or not it's morally right or wrong to include the dataset. I see merit to both arguments. But "too many tutorials use it" is very similar in flavour to "the economy of the South would collapse without slavery." >> >> Regarding fair uses of the feature, I would hope that all sklearn tutorials using the dataset mention such uses. The potential for abuse and misinterpretation is enormous. >> >> On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber , wrote: >>> Hi Tony >>> >>> As others have pointed out, I think that you may be misunderstanding the purpose of that "feature." We are in agreement that discrimination against protected classes is not OK, and that even outside complying with the law one should avoid discrimination, in model building or elsewhere. However, I disagree that one does this by eliminating from all datasets any feature that may allude to these protected classes. As Andreas pointed out, there is a growing effort to ensure that machine learning models are fair and benefit the common good (such as FATML, DSSG, etc..), and from my understanding the general consensus isn't necessarily that simply eliminating the feature is sufficient. I think we are in agreement that naively learning a model over a feature set containing questionable features and calling it a day is not okay, but as others have pointed out, having these features present and handling them appropriately can help guard against the model implicitly learning unfair ! > biases (e > ven if they are not explicitly exposed to the feature). >>> I would welcome the addition of the Ames dataset to the ones supported by sklearn, but I'm not convinced that the Boston dataset should be removed. As Andreas pointed out, there is a benefit to having canonical examples present so that beginners can easily follow along with the many tutorials that have been written using them. As Sean points out, the paper itself is trying to pull out the connection between house price and clean air in the presence of possible confounding variables. In a more general sense, saying that a feature shouldn't be there because a simple linear regression is unaffected by the results is a bit odd because it is very common for datasets to include irrelevant features, and handling them appropriately is important. In addition, one could argue that having this type of issue arise in a toy dataset has a benefit because it exposes these types of issues to those learning data science earlier on and allows them to keep these issues in mind in the future! > when the > data is more serious. >>> It is important for us all to keep issues of fairness in mind when it comes to data science. I'm glad that you're speaking out in favor of fairness and trying to bring attention to it. >>> >>> Jacob >>> >>> On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante wrote: >>> G Reina >>> you make a bizarre argument. You argue that you should not even check racism as a possible factor in house prices? >>> >>> But then you yourself check whether its relevant >>> Then you say >>> >>> "but I'd argue that it's more due to the location (near water, near businesses, near restaurants, near parks and recreation) than to the ethnic makeup" >>> >>> Which was basically what the original authors wanted to show too, >>> >>> Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. >>> >>> but unless you measure ethnic make-up you cannot show that it is not a confounder. >>> >>> The term "white flight" refers to affluent white families moving to the suburbs.. And clearly a question is whether/how much was racism or avoiding air pollution. >>> >>> >>> >>> >>> >>> On 6 Jul 2017 6:10 pm, "G Reina" wrote: >>> I'd like to request that the "Boston Housing Prices" dataset in sklearn (sklearn.datasets.load_boston) be replaced with the "Ames Housing Prices" dataset (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am willing to submit the code change if the developers agree. >>> >>> The Boston dataset has the feature "Bk is the proportion of blacks in town". It is an incredibly racist "feature" to include in any dataset. I think is beneath us as data scientists. >>> >>> I submit that the Ames dataset is a viable alternative for learning regression. The author has shown that the dataset is a more robust replacement for Boston. Ames is a 2011 regression dataset on housing prices and has more than 5 times the amount of training examples with over 7 times as many features (none of which are morally questionable). >>> >>> I welcome the community's thoughts on the matter. >>> >>> Thanks. >>> -Tony >>> >>> Here's an article I wrote on the Boston dataset: >>> https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From gael.varoquaux at normalesup.org Fri Jul 7 01:35:05 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Fri, 7 Jul 2017 07:35:05 +0200 Subject: [scikit-learn] Replacing the Boston Housing Prices dataset In-Reply-To: References: Message-ID: <20170707053505.GR2257694@phare.normalesup.org> Many people gave great points in this thread, in particular Jacob's well written email. Andy's point about tutorials is an important one. I don't resonate at all with Juan's message. Breaking people's code, even if it is the notes that they use to give a lecture, is a real cost for them. The cost varies on a case to case basis. But there are still books printed out there that demo image processing on Lena, and these will be out for decades. More importantly, the replacement of Lena used in scipy (the raccoon) does not allow to demonstrate denoising properly (Lena has smooth regions with details in the middle: the eyes), or segmentation. In effect, it has made the examples for the ecosystem less convincing. Of course, by definition, refusing to change anything implies that unfortunate situations, such as discriminatory biases, cannot be fixed. This is why changes should be considered on a case-to-case basis. The problem that we are facing here is that a dataset about society, the Boston housing dataset, can reveal discrimination. However, this is true of every data about society. The classic adult data (extracted from the American census) easily reveals income discrimination. I teach statistics with an IQ dataset where it is easy to show a male vs female IQ difference. This difference disappears after controlling for education (and the purpose of my course is to teach people to control for confounding effects). Data about society reveals its inequalities. Not working on such data is hiding problems, not fixing them. It is true that misuse of such data can attempt to establish inequalities as facts of life and get them accepted. When discussing these issues, we need to educate people about how to run and interpret analyses. No the Boston data will not go. No it is not a good thing to pretend that social problems do not exist. Ga?l On Fri, Jul 07, 2017 at 09:36:41AM +1000, Juan Nunez-Iglesias wrote: > For what it's worth: I'm sympathetic to the argument that you can't fix the > problem if you don't measure it, but I agree with Tony that "many tutorials use > it" is an extremely weak argument. We removed Lena from scikit-image because it > was the right thing to do. I very much doubt that Boston house prices is in > more widespread use than Lena was in image processing. > You can argue about whether or not it's morally right or wrong to include the > dataset. I see merit to both arguments. But "too many tutorials use it" is very > similar in flavour to "the economy of the South would collapse without > slavery." > Regarding fair uses of the feature, I would hope that all sklearn tutorials > using the dataset mention such uses. The potential for abuse and > misinterpretation is enormous. > On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber , wrote: > Hi Tony > As others have pointed out, I think that you may be misunderstanding the > purpose of that "feature." We are in agreement that discrimination against > protected classes is not OK, and that even outside complying with the law > one should avoid discrimination, in model building or elsewhere. However, I > disagree that one does this by eliminating from all datasets any feature > that may allude to these protected classes. As Andreas pointed out, there > is a growing effort to ensure that machine learning models are fair and > benefit the common good (such as FATML, DSSG, etc..), and from my > understanding the general consensus isn't necessarily that simply > eliminating the feature is sufficient. I think we are in agreement that > naively learning a model over a feature set containing questionable > features and calling it a day is not okay, but as others have pointed out, > having these features present and handling them appropriately can help > guard against the model implicitly learning unfair biases (even if they are > not explicitly exposed to the feature). > I would welcome the addition of the Ames dataset to the ones supported by > sklearn, but I'm not convinced that the Boston dataset should be removed. > As Andreas pointed out, there is a benefit to having canonical examples > present so that beginners can easily follow along with the many tutorials > that have been written using them. As Sean points out, the paper itself is > trying to pull out the connection between house price and clean air in the > presence of possible confounding variables. In a more general sense, saying > that a feature shouldn't be there because a simple linear regression is > unaffected by the results is a bit odd because it is very common for > datasets to include irrelevant features, and handling them appropriately is > important. In addition, one could argue that having this type of issue > arise in a toy dataset has a benefit because it exposes these types of > issues to those learning data science earlier on and allows them to keep > these issues in mind in the future when the data is more serious. > It is important for us all to keep issues of fairness in mind when it comes > to data science. I'm glad that you're speaking out in favor of fairness and > trying to bring attention to it. > Jacob > On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante > wrote: > G Reina > you make a bizarre argument. You argue that you should not even check > racism as a possible factor in house prices? > But then you yourself check whether its relevant > Then you say > "but I'd argue that it's more due to the location (near water, near > businesses, near restaurants, near parks and recreation) than to the > ethnic makeup" > Which was basically what the original authors wanted to show too, > Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for > clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. > but unless you measure ethnic make-up you cannot show that it is not a > confounder. > The term "white flight" refers to affluent white families moving to the > suburbs.. And clearly a question is whether/how much was racism or > avoiding air pollution. > On 6 Jul 2017 6:10 pm, "G Reina" wrote: > I'd like to request that the "Boston Housing Prices" dataset in > sklearn (sklearn.datasets.load_boston) be replaced with the "Ames > Housing Prices" dataset (https://ww2.amstat.org/publications/jse/ > v19n3/decock.pdf). I am willing to submit the code change if the > developers agree. > The Boston dataset has the feature "Bk is the proportion of blacks > in town". It is an incredibly racist "feature" to include in any > dataset. I think is beneath us as data scientists. > I submit that the Ames dataset is a viable alternative for learning > regression. The author has shown that the dataset is a more robust > replacement for Boston. Ames is a 2011 regression dataset on > housing prices and has more than 5 times the amount of training > examples with over 7 times as many features (none of which are > morally questionable). > I welcome the community's thoughts on the matter. > Thanks. > -Tony > Here's an article I wrote on the Boston dataset: > https://www.linkedin.com/pulse/hidden-racism-data-science-g- > anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_ > feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From olivier.grisel at ensta.org Fri Jul 7 09:24:32 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Fri, 7 Jul 2017 15:24:32 +0200 Subject: [scikit-learn] Which algorithm is used in sklearn SGDClassifier when modified huber loss is used? In-Reply-To: References: Message-ID: The name of the algorithm / model would be "L2-penalized linear model with modified Huber loss trained with Stochastic Gradient Descent". SVM is traditionally used to describe models that use the hinge loss only (or sometimes the squared hinge loss too). Only the log loss can be lead to a probabilistic linear binary classifiers in scikit-learn. -- Olivier From b.noushin7 at gmail.com Fri Jul 7 12:18:34 2017 From: b.noushin7 at gmail.com (Ariani A) Date: Fri, 7 Jul 2017 12:18:34 -0400 Subject: [scikit-learn] Help with NLP Message-ID: Dear all, I need an urgent help with NLP, do you happen to know anyone who knows nltk or NLP modules? Have anybody of you read this paper? "Template-Based Information Extraction without the Templates." I am looking forward to hearirng from you soon! Best, -Ariani -------------- next part -------------- An HTML attachment was scrubbed... URL: From noflaco at gmail.com Fri Jul 7 12:23:16 2017 From: noflaco at gmail.com (Carlton Banks) Date: Fri, 7 Jul 2017 18:23:16 +0200 Subject: [scikit-learn] Help with NLP In-Reply-To: References: Message-ID: <1694DCBE-443C-4EB0-B2F5-2A0FCC67D5FB@gmail.com> NLP as is Natural language processing? > Den 7. jul. 2017 kl. 18.18 skrev Ariani A : > > Dear all, > I need an urgent help with NLP, do you happen to know anyone who knows nltk or NLP modules? Have anybody of you read this paper? > "Template-Based Information Extraction without the Templates." > I am looking forward to hearirng from you soon! > Best, > -Ariani > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From b.noushin7 at gmail.com Fri Jul 7 12:24:22 2017 From: b.noushin7 at gmail.com (Ariani A) Date: Fri, 7 Jul 2017 12:24:22 -0400 Subject: [scikit-learn] Help with NLP In-Reply-To: <1694DCBE-443C-4EB0-B2F5-2A0FCC67D5FB@gmail.com> References: <1694DCBE-443C-4EB0-B2F5-2A0FCC67D5FB@gmail.com> Message-ID: Yes , it is. regards On Fri, Jul 7, 2017 at 12:23 PM, Carlton Banks wrote: > NLP as is Natural language processing? > > Den 7. jul. 2017 kl. 18.18 skrev Ariani A : > > Dear all, > I need an urgent help with NLP, do you happen to know anyone who knows > nltk or NLP modules? Have anybody of you read this paper? > "Template-Based Information Extraction without the Templates." > I am looking forward to hearirng from you soon! > Best, > -Ariani > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From noflaco at gmail.com Fri Jul 7 12:50:58 2017 From: noflaco at gmail.com (Carlton Banks) Date: Fri, 7 Jul 2017 18:50:58 +0200 Subject: [scikit-learn] Help with NLP In-Reply-To: References: <1694DCBE-443C-4EB0-B2F5-2A0FCC67D5FB@gmail.com> Message-ID: I am still not sure i quite understand.. What aspect of NLP are you involved in speech recognition? > Den 7. jul. 2017 kl. 18.24 skrev Ariani A : > > Yes , it is. > regards > > On Fri, Jul 7, 2017 at 12:23 PM, Carlton Banks > wrote: > NLP as is Natural language processing? > >> Den 7. jul. 2017 kl. 18.18 skrev Ariani A >: >> >> Dear all, >> I need an urgent help with NLP, do you happen to know anyone who knows nltk or NLP modules? Have anybody of you read this paper? >> "Template-Based Information Extraction without the Templates." >> I am looking forward to hearirng from you soon! >> Best, >> -Ariani >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Fri Jul 7 12:52:15 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Fri, 07 Jul 2017 16:52:15 +0000 Subject: [scikit-learn] Help with NLP In-Reply-To: References: <1694DCBE-443C-4EB0-B2F5-2A0FCC67D5FB@gmail.com> Message-ID: The scikit-learn mailing list is probably not the best place to be asking for help with another module. On Fri, Jul 7, 2017 at 9:28 AM Ariani A wrote: > Yes , it is. > regards > > On Fri, Jul 7, 2017 at 12:23 PM, Carlton Banks wrote: > >> NLP as is Natural language processing? >> >> Den 7. jul. 2017 kl. 18.18 skrev Ariani A : >> >> Dear all, >> I need an urgent help with NLP, do you happen to know anyone who knows >> nltk or NLP modules? Have anybody of you read this paper? >> "Template-Based Information Extraction without the Templates." >> I am looking forward to hearirng from you soon! >> Best, >> -Ariani >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From b.noushin7 at gmail.com Fri Jul 7 13:13:28 2017 From: b.noushin7 at gmail.com (Ariani A) Date: Fri, 7 Jul 2017 13:13:28 -0400 Subject: [scikit-learn] Help with NLP In-Reply-To: References: <1694DCBE-443C-4EB0-B2F5-2A0FCC67D5FB@gmail.com> Message-ID: Dear Jacob, I know, but I am just asking to get help! @Carlton, I want to do text processing, can I email you so that the others do not bother? Best, -Ariani On Fri, Jul 7, 2017 at 12:52 PM, Jacob Schreiber wrote: > The scikit-learn mailing list is probably not the best place to be asking > for help with another module. > > On Fri, Jul 7, 2017 at 9:28 AM Ariani A wrote: > >> Yes , it is. >> regards >> >> On Fri, Jul 7, 2017 at 12:23 PM, Carlton Banks wrote: >> >>> NLP as is Natural language processing? >>> >>> Den 7. jul. 2017 kl. 18.18 skrev Ariani A : >>> >>> Dear all, >>> I need an urgent help with NLP, do you happen to know anyone who knows >>> nltk or NLP modules? Have anybody of you read this paper? >>> "Template-Based Information Extraction without the Templates." >>> I am looking forward to hearirng from you soon! >>> Best, >>> -Ariani >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Fri Jul 7 14:51:26 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Fri, 7 Jul 2017 20:51:26 +0200 Subject: [scikit-learn] Help with NLP In-Reply-To: References: <1694DCBE-443C-4EB0-B2F5-2A0FCC67D5FB@gmail.com> Message-ID: Please use this mailing list if you have targeted scikit-learn mailing list questions. Otherwise you should better ask a specific question on an NLP and datascience community platform such as: https://datascience.stackexchange.com/questions/tagged/nlp or if you have a programming related questions for NLTK or a related library: https://stackoverflow.com/questions/tagged/nlp Also in any case, I would advise you not to ask for "urgent help". Instead ask specific questions. Otherwise you are likely to never get any useful answer to your question. If you do not know where to start, read tutorials and introductory books on NLTK or NLP in general instead. -- Olivier From jni.soma at gmail.com Fri Jul 7 22:44:38 2017 From: jni.soma at gmail.com (Juan Nunez-Iglesias) Date: Sat, 8 Jul 2017 12:44:38 +1000 Subject: [scikit-learn] Replacing the Boston Housing Prices dataset In-Reply-To: <20170707053505.GR2257694@phare.normalesup.org> References: <20170707053505.GR2257694@phare.normalesup.org> Message-ID: <79f8556c-d6ea-4bed-8f5a-3f4d1a5bda3e@Spark> Just to clarify a couple of things about my position. First, thanks Ga?l for a thoughtful response. I fully respect your decision to keep the Boston dataset, and I agree that it can be a useful "teaching moment." (As I suggested in my earlier post.) With regards to breaking tutorials, however, I totally disagree. The whole value of tutorials is that they teach general principles, not analysis of specific datasets. Changing a tutorial dataset is thus different from changing an API. This isn't the right forum for a discussion about the ethics of the Lena image, so I won't go into that, but to suggest that it is a uniquely effective picture, the natural image equivalent of a standard test pattern, is ludicrous. Maybe the replacement wasn't as good, but that is a criticism of the choice of replacement, not of the decision to replace it. There clearly exist millions or billions of images with similarly good teaching characteristics. Finally, yes, removing and deprecating datasets incurs (and inflicts) a real cost, but cost should be at best a minor consideration when dealing with ethical questions. History, and daily life, are replete with unethical decisions made under the excuse that it would cost too much to do what's right. Ultimately the costs are usually found to have been exaggerated. With regards to this dataset, I cede the argument to maintainers, contributors, and users of the dataset, but I will point out that none of the existing tutorials in the library mention this feature, let alone addresses the ethics of it. The DESCR field mentions it entirely nonchalantly, like it is a natural thing to want to measure if one wants to predict house prices. I think I would certainly have a WTF moment, at least, if I was a black student reading through that description. Juan. On 7 Jul 2017, 3:36 PM +1000, Gael Varoquaux , wrote: > Many people gave great points in this thread, in particular Jacob's well > written email. > > Andy's point about tutorials is an important one. I don't resonate at > all with Juan's message. Breaking people's code, even if it is the notes > that they use to give a lecture, is a real cost for them. The cost varies > on a case to case basis. But there are still books printed out there > that demo image processing on Lena, and these will be out for decades. > More importantly, the replacement of Lena used in scipy (the raccoon) > does not allow to demonstrate denoising properly (Lena has smooth regions > with details in the middle: the eyes), or segmentation. In effect, it has > made the examples for the ecosystem less convincing. > > > Of course, by definition, refusing to change anything implies that > unfortunate situations, such as discriminatory biases, cannot be fixed. > This is why changes should be considered on a case-to-case basis. > > The problem that we are facing here is that a dataset about society, the > Boston housing dataset, can reveal discrimination. However, this is true > of every data about society. The classic adult data (extracted from the > American census) easily reveals income discrimination. I teach statistics > with an IQ dataset where it is easy to show a male vs female IQ > difference. This difference disappears after controlling for education > (and the purpose of my course is to teach people to control for > confounding effects). > > Data about society reveals its inequalities. Not working on such data is > hiding problems, not fixing them. It is true that misuse of such data can > attempt to establish inequalities as facts of life and get them accepted. > When discussing these issues, we need to educate people about how to run > and interpret analyses. > > > No the Boston data will not go. No it is not a good thing to pretend that > social problems do not exist. > > > Ga?l > > On Fri, Jul 07, 2017 at 09:36:41AM +1000, Juan Nunez-Iglesias wrote: > > For what it's worth: I'm sympathetic to the argument that you can't fix the > > problem if you don't measure it, but I agree with Tony that "many tutorials use > > it" is an extremely weak argument. We removed Lena from scikit-image because it > > was the right thing to do. I very much doubt that Boston house prices is in > > more widespread use than Lena was in image processing. > > > You can argue about whether or not it's morally right or wrong to include the > > dataset. I see merit to both arguments. But "too many tutorials use it" is very > > similar in flavour to "the economy of the South would collapse without > > slavery." > > > Regarding fair uses of the feature, I would hope that all sklearn tutorials > > using the dataset mention such uses. The potential for abuse and > > misinterpretation is enormous. > > > On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber , wrote: > > > Hi Tony > > > As others have pointed out, I think that you may be misunderstanding the > > purpose of that "feature." We are in agreement that discrimination against > > protected classes is not OK, and that even outside complying with the law > > one should avoid discrimination, in model building or elsewhere. However, I > > disagree that one does this by eliminating from all datasets any feature > > that may allude to these protected classes. As Andreas pointed out, there > > is a growing effort to ensure that machine learning models are fair and > > benefit the common good (such as FATML, DSSG, etc..), and from my > > understanding the general consensus isn't necessarily that simply > > eliminating the feature is sufficient. I think we are in agreement that > > naively learning a model over a feature set containing questionable > > features and calling it a day is not okay, but as others have pointed out, > > having these features present and handling them appropriately can help > > guard against the model implicitly learning unfair biases (even if they are > > not explicitly exposed to the feature). > > > I would welcome the addition of the Ames dataset to the ones supported by > > sklearn, but I'm not convinced that the Boston dataset should be removed. > > As Andreas pointed out, there is a benefit to having canonical examples > > present so that beginners can easily follow along with the many tutorials > > that have been written using them. As Sean points out, the paper itself is > > trying to pull out the connection between house price and clean air in the > > presence of possible confounding variables. In a more general sense, saying > > that a feature shouldn't be there because a simple linear regression is > > unaffected by the results is a bit odd because it is very common for > > datasets to include irrelevant features, and handling them appropriately is > > important. In addition, one could argue that having this type of issue > > arise in a toy dataset has a benefit because it exposes these types of > > issues to those learning data science earlier on and allows them to keep > > these issues in mind in the future when the data is more serious. > > > It is important for us all to keep issues of fairness in mind when it comes > > to data science. I'm glad that you're speaking out in favor of fairness and > > trying to bring attention to it. > > > Jacob > > > On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante > wrote: > > > G Reina > > you make a bizarre argument. You argue that you should not even check > > racism as a possible factor in house prices? > > > But then you yourself check whether its relevant > > Then you say > > > "but I'd argue that it's more due to the location (near water, near > > businesses, near restaurants, near parks and recreation) than to the > > ethnic makeup" > > > Which was basically what the original authors wanted to show too, > > > Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for > > clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. > > > but unless you measure ethnic make-up you cannot show that it is not a > > confounder. > > > The term "white flight" refers to affluent white families moving to the > > suburbs.. And clearly a question is whether/how much was racism or > > avoiding air pollution. > > > > > > > On 6 Jul 2017 6:10 pm, "G Reina" wrote: > > > I'd like to request that the "Boston Housing Prices" dataset in > > sklearn (sklearn.datasets.load_boston) be replaced with the "Ames > > Housing Prices" dataset (https://ww2.amstat.org/publications/jse/ > > v19n3/decock.pdf). I am willing to submit the code change if the > > developers agree. > > > The Boston dataset has the feature "Bk is the proportion of blacks > > in town". It is an incredibly racist "feature" to include in any > > dataset. I think is beneath us as data scientists. > > > I submit that the Ames dataset is a viable alternative for learning > > regression. The author has shown that the dataset is a more robust > > replacement for Boston. Ames is a 2011 regression dataset on > > housing prices and has more than 5 times the amount of training > > examples with over 7 times as many features (none of which are > > morally questionable). > > > I welcome the community's thoughts on the matter. > > > Thanks. > > -Tony > > > Here's an article I wrote on the Boston dataset: > > https://www.linkedin.com/pulse/hidden-racism-data-science-g- > > anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_ > > feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > Gael Varoquaux > Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Sat Jul 8 00:26:43 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Fri, 7 Jul 2017 21:26:43 -0700 Subject: [scikit-learn] Replacing the Boston Housing Prices dataset In-Reply-To: <79f8556c-d6ea-4bed-8f5a-3f4d1a5bda3e@Spark> References: <20170707053505.GR2257694@phare.normalesup.org> <79f8556c-d6ea-4bed-8f5a-3f4d1a5bda3e@Spark> Message-ID: We would welcome a pull request amending the documentation to include a neutral discussion of the issues you've brought up. Optimally, it would include many of the points brought up in this discussion as to why it was ultimately kept despite the issues being raised. On Fri, Jul 7, 2017 at 7:44 PM, Juan Nunez-Iglesias wrote: > Just to clarify a couple of things about my position. > > First, thanks Ga?l for a thoughtful response. I fully respect your > decision to keep the Boston dataset, and I agree that it can be a useful > "teaching moment." (As I suggested in my earlier post.) > > With regards to breaking tutorials, however, I totally disagree. The whole > value of tutorials is that they teach general principles, not analysis of > specific datasets. Changing a tutorial dataset is thus different from > changing an API. This isn't the right forum for a discussion about the > ethics of the Lena image, so I won't go into that, but to suggest that it > is a uniquely effective picture, the natural image equivalent of a standard > test pattern, is ludicrous. Maybe the replacement wasn't as good, but that > is a criticism of the choice of replacement, not of the decision to replace > it. There clearly exist millions or billions of images with similarly good > teaching characteristics. > > Finally, yes, removing and deprecating datasets incurs (and inflicts) a > real cost, but cost should be at best a minor consideration when dealing > with ethical questions. History, and daily life, are replete with unethical > decisions made under the excuse that it would cost too much to do what's > right. Ultimately the costs are usually found to have been exaggerated. > > With regards to this dataset, I cede the argument to maintainers, > contributors, and users of the dataset, but I will point out that none of > the existing tutorials > > in the library mention this feature, let alone addresses the ethics of it. > The DESCR field mentions it entirely nonchalantly, like it is a natural > thing to want to measure if one wants to predict house prices. I think I > would certainly have a WTF moment, at least, if I was a black student > reading through that description. > > Juan. > > On 7 Jul 2017, 3:36 PM +1000, Gael Varoquaux < > gael.varoquaux at normalesup.org>, wrote: > > Many people gave great points in this thread, in particular Jacob's well > written email. > > Andy's point about tutorials is an important one. I don't resonate at > all with Juan's message. Breaking people's code, even if it is the notes > that they use to give a lecture, is a real cost for them. The cost varies > on a case to case basis. But there are still books printed out there > that demo image processing on Lena, and these will be out for decades. > More importantly, the replacement of Lena used in scipy (the raccoon) > does not allow to demonstrate denoising properly (Lena has smooth regions > with details in the middle: the eyes), or segmentation. In effect, it has > made the examples for the ecosystem less convincing. > > > Of course, by definition, refusing to change anything implies that > unfortunate situations, such as discriminatory biases, cannot be fixed. > This is why changes should be considered on a case-to-case basis. > > The problem that we are facing here is that a dataset about society, the > Boston housing dataset, can reveal discrimination. However, this is true > of every data about society. The classic adult data (extracted from the > American census) easily reveals income discrimination. I teach statistics > with an IQ dataset where it is easy to show a male vs female IQ > difference. This difference disappears after controlling for education > (and the purpose of my course is to teach people to control for > confounding effects). > > Data about society reveals its inequalities. Not working on such data is > hiding problems, not fixing them. It is true that misuse of such data can > attempt to establish inequalities as facts of life and get them accepted. > When discussing these issues, we need to educate people about how to run > and interpret analyses. > > > No the Boston data will not go. No it is not a good thing to pretend that > social problems do not exist. > > > Ga?l > > On Fri, Jul 07, 2017 at 09:36:41AM +1000, Juan Nunez-Iglesias wrote: > > For what it's worth: I'm sympathetic to the argument that you can't fix the > problem if you don't measure it, but I agree with Tony that "many > tutorials use > it" is an extremely weak argument. We removed Lena from scikit-image > because it > was the right thing to do. I very much doubt that Boston house prices is in > more widespread use than Lena was in image processing. > > > You can argue about whether or not it's morally right or wrong to include > the > dataset. I see merit to both arguments. But "too many tutorials use it" is > very > similar in flavour to "the economy of the South would collapse without > slavery." > > > Regarding fair uses of the feature, I would hope that all sklearn tutorials > using the dataset mention such uses. The potential for abuse and > misinterpretation is enormous. > > > On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber , > wrote: > > > Hi Tony > > > As others have pointed out, I think that you may be misunderstanding the > purpose of that "feature." We are in agreement that discrimination against > protected classes is not OK, and that even outside complying with the law > one should avoid discrimination, in model building or elsewhere. However, I > disagree that one does this by eliminating from all datasets any feature > that may allude to these protected classes. As Andreas pointed out, there > is a growing effort to ensure that machine learning models are fair and > benefit the common good (such as FATML, DSSG, etc..), and from my > understanding the general consensus isn't necessarily that simply > eliminating the feature is sufficient. I think we are in agreement that > naively learning a model over a feature set containing questionable > features and calling it a day is not okay, but as others have pointed out, > having these features present and handling them appropriately can help > guard against the model implicitly learning unfair biases (even if they are > not explicitly exposed to the feature). > > > I would welcome the addition of the Ames dataset to the ones supported by > sklearn, but I'm not convinced that the Boston dataset should be removed. > As Andreas pointed out, there is a benefit to having canonical examples > present so that beginners can easily follow along with the many tutorials > that have been written using them. As Sean points out, the paper itself is > trying to pull out the connection between house price and clean air in the > presence of possible confounding variables. In a more general sense, saying > that a feature shouldn't be there because a simple linear regression is > unaffected by the results is a bit odd because it is very common for > datasets to include irrelevant features, and handling them appropriately is > important. In addition, one could argue that having this type of issue > arise in a toy dataset has a benefit because it exposes these types of > issues to those learning data science earlier on and allows them to keep > these issues in mind in the future when the data is more serious. > > > It is important for us all to keep issues of fairness in mind when it comes > to data science. I'm glad that you're speaking out in favor of fairness and > trying to bring attention to it. > > > Jacob > > > On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante wrote: > > > G Reina > you make a bizarre argument. You argue that you should not even check > racism as a possible factor in house prices? > > > But then you yourself check whether its relevant > Then you say > > > "but I'd argue that it's more due to the location (near water, near > businesses, near restaurants, near parks and recreation) than to the > ethnic makeup" > > > Which was basically what the original authors wanted to show too, > > > Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for > clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. > > > but unless you measure ethnic make-up you cannot show that it is not a > confounder. > > > The term "white flight" refers to affluent white families moving to the > suburbs.. And clearly a question is whether/how much was racism or > avoiding air pollution. > > > > > > > On 6 Jul 2017 6:10 pm, "G Reina" wrote: > > > I'd like to request that the "Boston Housing Prices" dataset in > sklearn (sklearn.datasets.load_boston) be replaced with the "Ames > Housing Prices" dataset (https://ww2.amstat.org/publications/jse/ > v19n3/decock.pdf). I am willing to submit the code change if the > developers agree. > > > The Boston dataset has the feature "Bk is the proportion of blacks > in town". It is an incredibly racist "feature" to include in any > dataset. I think is beneath us as data scientists. > > > I submit that the Ames dataset is a viable alternative for learning > regression. The author has shown that the dataset is a more robust > replacement for Boston. Ames is a 2011 regression dataset on > housing prices and has more than 5 times the amount of training > examples with over 7 times as many features (none of which are > morally questionable). > > > I welcome the community's thoughts on the matter. > > > Thanks. > -Tony > > > Here's an article I wrote on the Boston dataset: > https://www.linkedin.com/pulse/hidden-racism-data-science-g- > anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_ > feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > -- > Gael Varoquaux > Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 <+33%201%2069%2008%2079%2068> > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mathieu at mblondel.org Sat Jul 8 02:22:31 2017 From: mathieu at mblondel.org (Mathieu Blondel) Date: Sat, 8 Jul 2017 15:22:31 +0900 Subject: [scikit-learn] Fwd: Feedback on scikit-learn.org In-Reply-To: References: Message-ID: Someone had this to say about eigenfaces. ---------- Forwarded message ---------- From: Frances Liu Date: Sat, Jul 8, 2017 at 6:57 AM Subject: Feedback on scikit-learn.org To: mathieu at mblondel.org Hi Mathieu, I found your email on your personal website, which is linked at the top of the authors list for scikit-learn.org. I just want to submit a small complaint -- the page for dimensionality reduction: http://scikit- learn.org/stable/modules/decomposition.html#decompositions uses faces as examples. The generate faces are wayyyyyyy too scary. Considering that minors and people with health conditions may visit the website, could you use some less horrifying examples please? Thank you! Best, Frances -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Sat Jul 8 03:12:53 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Sat, 8 Jul 2017 09:12:53 +0200 Subject: [scikit-learn] Fwd: Feedback on scikit-learn.org In-Reply-To: References: Message-ID: <20170708071253.GP2257694@phare.normalesup.org> The way that I would think about such a question is that a website like that of the New York Times, or of Le Monde, has pictures that are much more scary. We are probably on the safe end. Cheers, Ga?l On Sat, Jul 08, 2017 at 03:22:31PM +0900, Mathieu Blondel wrote: > Someone had this to say about eigenfaces. > ---------- Forwarded message ---------- > From: Frances Liu > Date: Sat, Jul 8, 2017 at 6:57 AM > Subject: Feedback on scikit-learn.org > To: mathieu at mblondel.org > Hi Mathieu, > I found your email on your personal website, which is linked at the top of the > authors list for scikit-learn.org. I just want to submit a small complaint -- > the page for dimensionality reduction:?http://scikit-learn.org/stable/modules/ > decomposition.html#decompositions > uses faces as examples. The generate faces are wayyyyyyy too scary. Considering > that minors and people with health conditions may visit the website, could you > use some less horrifying examples please? > Thank you! > Best, > Frances > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From g.lemaitre58 at gmail.com Sat Jul 8 04:50:57 2017 From: g.lemaitre58 at gmail.com (Guillaume Lemaitre) Date: Sat, 08 Jul 2017 10:50:57 +0200 Subject: [scikit-learn] Fwd: Feedback on scikit-learn.org In-Reply-To: <20170708071253.GP2257694@phare.normalesup.org> References: <20170708071253.GP2257694@phare.normalesup.org> Message-ID: <20170708085057.4870225.70142.35245@gmail.com> In the same line, we should stop publishing faces generated by GANs. They are even worse :-) Guillaume?Lemaitre? INRIA?Saclay?Ile-de-France?/?Equipe?PARIETAL guillaume.lemaitre at inria.fr?-?https://glemaitre.github.io/ From warren.weckesser at gmail.com Sat Jul 8 05:13:54 2017 From: warren.weckesser at gmail.com (Warren Weckesser) Date: Sat, 8 Jul 2017 05:13:54 -0400 Subject: [scikit-learn] Fwd: Feedback on scikit-learn.org In-Reply-To: References: Message-ID: Obligatory meme: https://imgur.com/a/BLimp Warren On Sat, Jul 8, 2017 at 2:22 AM, Mathieu Blondel wrote: > Someone had this to say about eigenfaces. > > ---------- Forwarded message ---------- > From: Frances Liu > Date: Sat, Jul 8, 2017 at 6:57 AM > Subject: Feedback on scikit-learn.org > To: mathieu at mblondel.org > > > Hi Mathieu, > > I found your email on your personal website, which is linked at the top of > the authors list for scikit-learn.org. I just want to submit a small > complaint -- the page for dimensionality reduction: http://scikit-learn > .org/stable/modules/decomposition.html#decompositions > uses faces as examples. The generate faces are wayyyyyyy too scary. > Considering that minors and people with health conditions may visit the > website, could you use some less horrifying examples please? > > Thank you! > > Best, > Frances > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From valia.rodriguez at gmail.com Sat Jul 8 07:00:56 2017 From: valia.rodriguez at gmail.com (Valia Rodriguez) Date: Sat, 8 Jul 2017 12:00:56 +0100 Subject: [scikit-learn] Replacing the Boston Housing Prices dataset In-Reply-To: References: <20170707053505.GR2257694@phare.normalesup.org> <79f8556c-d6ea-4bed-8f5a-3f4d1a5bda3e@Spark> Message-ID: Hello everybody I just subscribed to this list to let you know what I think about this topic as a black woman I am. My husband who is in this list told me about the discussion going on and I wanted to share with all of you my thoughts: There is nothing wrong or racist in counting how many black people there is in a given population, as it is not racist either to count how many Asian or white people there are. First in many epidemiologic, demographic and sociologic studies we need to take in account -and do counting on the bases of -ethnicity, skin color or race; depends on where in the world we are doing the study and depends on the population we are counting. There is no other way to address these topics if you do not count how many blacks, whites, Asians and so on. Any teaching should simulate real conditions, so a dataset including this is fine. It is valid to count in the bases of skin color because if we don't, how to study then distribution of wealth or even racism itself? Second: there is nothing wrong with the word: black. That word should not rise a flag. I am black and that is fine for me and for any other person like me to be called black because we are -depends on the context of course. As it is nothing wrong with being white and being part of a counting for 'number of whites' for a specific study. It will be very bad if the dataset says however 'number of coloured people' to refer to black people, that would be very racist. Valia On Sat, Jul 8, 2017 at 10:31 AM, Matthew Brett wrote: > > Forwarded conversation > Subject: [scikit-learn] Replacing the Boston Housing Prices dataset > ------------------------ > > From: G Reina > Date: Thu, Jul 6, 2017 at 5:05 PM > To: scikit-learn at python.org > > > I'd like to request that the "Boston Housing Prices" dataset in sklearn > (sklearn.datasets.load_boston) be replaced with the "Ames Housing Prices" > dataset (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am > willing to submit the code change if the developers agree. > > The Boston dataset has the feature "Bk is the proportion of blacks in town". > It is an incredibly racist "feature" to include in any dataset. I think is > beneath us as data scientists. > > I submit that the Ames dataset is a viable alternative for learning > regression. The author has shown that the dataset is a more robust > replacement for Boston. Ames is a 2011 regression dataset on housing prices > and has more than 5 times the amount of training examples with over 7 times > as many features (none of which are morally questionable). > > I welcome the community's thoughts on the matter. > > Thanks. > -Tony > > Here's an article I wrote on the Boston dataset: > https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ---------- > From: Andreas Mueller > Date: Thu, Jul 6, 2017 at 5:31 PM > To: scikit-learn at python.org > > > Hi Tony. > > I don't think it's a good idea to remove the dataset, given how many > tutorials and examples rely on it. > I also don't think it's a good idea to ignore racial discrimination, which I > guess this feature is trying to capture. > > I was recently asked to remove an excerpt from a dataset from my slide, as > it was "too racist". It was randomly sampled > data from the adult census dataset. Unfortunately, economics in the US are > not color blind (yet), and the reality is racist. > I haven't done an in-depth analysis on whether this feature is actually > informative, but I don't think your analysis is conclusive. > > Including ethnicity in data actually allows us to ensure "fairness" in > certain decision making processes. > Without collecting this data, it would be impossible to ensure automatic > decisions are not influenced > by past human biases. Arguably that's not what the authors of this dataset > are doing. > > Check out http://www.fatml.org/ for more on fairness in machine learning and > data science. > > Cheers, > Andy > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ---------- > From: G Reina > Date: Thu, Jul 6, 2017 at 5:41 PM > To: Scikit-learn user and developer mailing list > > > Wow. I completely disagree. > > The fact that too many tutorials and examples rely on it is not a reason to > keep the dataset. New tutorials are written all the time. And, as sklearn > evolves some of the existing tutorials will need to be updated anyway to > keep up with the changes. > > Including "ethnicity" is completely illegal in making business decisions in > the United States. For example, credit scoring systems bend over backward to > expunge even proxy features that could be highly correlated with race (for > example, they can't include neighborhood, but can include entire counties). > > Let's leave the studying of racism to actual scientists who study racism. > Not to toy datasets that we use to teach our students about a completely > unrelated matter like regression. > > -Tony > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ---------- > From: Andrew Holmes > Date: Thu, Jul 6, 2017 at 5:19 PM > To: Scikit-learn user and developer mailing list > > > But how do social scientists do research into racism without including > ethnicity as a feature in the data? > > Best wishes > Andrew > > Public Profile > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ---------- > From: jma > Date: Thu, Jul 6, 2017 at 6:38 PM > To: scikit-learn at python.org > > > I work in the financial services industry and build machine learning models > for marketing applications. We put an enormous effort (multiple layers of > oversight and governance) into ensuring that our models are free of bias > against protected classes etc. Having data describing race and ethnicity > (among others) is extremely important to validate this is indeed the case. > Without it, you have no such assurance. > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ---------- > From: Andreas Mueller > Date: Thu, Jul 6, 2017 at 7:09 PM > To: scikit-learn at python.org > > > > > On 07/06/2017 12:41 PM, G Reina wrote: >> >> >> The fact that too many tutorials and examples rely on it is not a reason >> to keep the dataset. New tutorials are written all the time. And, as sklearn >> evolves some of the existing tutorials will need to be updated anyway to >> keep up with the changes. > > No, we try to avoid that as much as possible. > Old examples should work for as long as possible, and we actively avoid > breaking API unnecessarily. It's one of the core principles of scikit-learn > development. > > And new tutorials can use any dataset they choose. We are working on > including an openml fetcher, which allows using more datasets more easily. > > ---------- > From: Sean Violante > Date: Thu, Jul 6, 2017 at 8:08 PM > To: Scikit-learn user and developer mailing list > > > G Reina > you make a bizarre argument. You argue that you should not even check racism > as a possible factor in house prices? > > But then you yourself check whether its relevant > Then you say > > "but I'd argue that it's more due to the location (near water, near > businesses, near restaurants, near parks and recreation) than to the ethnic > makeup" > > Which was basically what the original authors wanted to show too, > > Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean > air', J. Environ. Economics & Management, vol.5, 81-102, 1978. > > but unless you measure ethnic make-up you cannot show that it is not a > confounder. > > The term "white flight" refers to affluent white families moving to the > suburbs.. And clearly a question is whether/how much was racism or avoiding > air pollution. > > > > > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ---------- > From: Jacob Schreiber > Date: Thu, Jul 6, 2017 at 9:34 PM > To: Scikit-learn user and developer mailing list > > > Hi Tony > > As others have pointed out, I think that you may be misunderstanding the > purpose of that "feature." We are in agreement that discrimination against > protected classes is not OK, and that even outside complying with the law > one should avoid discrimination, in model building or elsewhere. However, I > disagree that one does this by eliminating from all datasets any feature > that may allude to these protected classes. As Andreas pointed out, there is > a growing effort to ensure that machine learning models are fair and benefit > the common good (such as FATML, DSSG, etc..), and from my understanding the > general consensus isn't necessarily that simply eliminating the feature is > sufficient. I think we are in agreement that naively learning a model over a > feature set containing questionable features and calling it a day is not > okay, but as others have pointed out, having these features present and > handling them appropriately can help guard against the model implicitly > learning unfair biases (even if they are not explicitly exposed to the > feature). > > I would welcome the addition of the Ames dataset to the ones supported by > sklearn, but I'm not convinced that the Boston dataset should be removed. As > Andreas pointed out, there is a benefit to having canonical examples present > so that beginners can easily follow along with the many tutorials that have > been written using them. As Sean points out, the paper itself is trying to > pull out the connection between house price and clean air in the presence of > possible confounding variables. In a more general sense, saying that a > feature shouldn't be there because a simple linear regression is unaffected > by the results is a bit odd because it is very common for datasets to > include irrelevant features, and handling them appropriately is important. > In addition, one could argue that having this type of issue arise in a toy > dataset has a benefit because it exposes these types of issues to those > learning data science earlier on and allows them to keep these issues in > mind in the future when the data is more serious. > > It is important for us all to keep issues of fairness in mind when it comes > to data science. I'm glad that you're speaking out in favor of fairness and > trying to bring attention to it. > > Jacob > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ---------- > From: Juan Nunez-Iglesias > Date: Fri, Jul 7, 2017 at 12:36 AM > To: Scikit-learn user and developer mailing list > > > For what it's worth: I'm sympathetic to the argument that you can't fix the > problem if you don't measure it, but I agree with Tony that "many tutorials > use it" is an extremely weak argument. We removed Lena from scikit-image > because it was the right thing to do. I very much doubt that Boston house > prices is in more widespread use than Lena was in image processing. > > You can argue about whether or not it's morally right or wrong to include > the dataset. I see merit to both arguments. But "too many tutorials use it" > is very similar in flavour to "the economy of the South would collapse > without slavery." > > Regarding fair uses of the feature, I would hope that all sklearn tutorials > using the dataset mention such uses. The potential for abuse and > misinterpretation is enormous. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ---------- > From: Sebastian Raschka > Date: Fri, Jul 7, 2017 at 1:39 AM > To: Scikit-learn user and developer mailing list > > > I think there can be some middle ground. I.e., adding a new, simple dataset > to demonstrate regression (maybe autmpg, wine quality, or sth like that) and > use that for the scikit-learn examples in the main documentation etc but > leave the boston dataset in the code base for now. Whether it's a weak > argument or not, it would be quite destructive to remove the dataset > altogether in the next version or so, not only because old tutorials use it > but many unit tests in many different projects depend on it. I think it > might be better to phase it out by having a good alternative first, and I am > sure that the scikit-learn maintainers wouldn't have anything against it if > someone would update the examples/tutorials with the use of different > datasets > > Best, > Sebastian > > ---------- > From: Bill Ross > Date: Fri, Jul 7, 2017 at 2:00 AM > To: scikit-learn at python.org > > > Unless the data concretely promotes discrimination, it seems discriminatory > to exclude it. > > Bill > > On 7/6/17 5:39 PM, Sebastian Raschka wrote: >> >> I think there can be some middle ground. I.e., adding a new, simple >> dataset to demonstrate regression (maybe autmpg, wine quality, or sth like >> that) and use that for the scikit-learn examples in the main documentation >> etc but leave the boston dataset in the code base for now. Whether it's a >> weak argument or not, it would be quite destructive to remove the dataset >> altogether in the next version or so, not only because old tutorials use it >> but many unit tests in many different projects depend on it. I think it >> might be better to phase it out by having a good alternative first, and I am >> sure that the scikit-learn maintainers wouldn't have anything against it if >> someone would update the examples/tutorials with the use of different >> datasets >> >> Best, >> Sebastian >> >>> On Jul 6, 2017, at 7:36 PM, Juan Nunez-Iglesias >>> wrote: >>> >>> For what it's worth: I'm sympathetic to the argument that you can't fix >>> the problem if you don't measure it, but I agree with Tony that "many >>> tutorials use it" is an extremely weak argument. We removed Lena from >>> scikit-image because it was the right thing to do. I very much doubt that >>> Boston house prices is in more widespread use than Lena was in image >>> processing. >>> >>> You can argue about whether or not it's morally right or wrong to include >>> the dataset. I see merit to both arguments. But "too many tutorials use it" >>> is very similar in flavour to "the economy of the South would collapse >>> without slavery." >>> >>> Regarding fair uses of the feature, I would hope that all sklearn >>> tutorials using the dataset mention such uses. The potential for abuse and >>> misinterpretation is enormous. >>> >>> On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber , >>> wrote: >>>> >>>> Hi Tony >>>> >>>> As others have pointed out, I think that you may be misunderstanding the >>>> purpose of that "feature." We are in agreement that discrimination against >>>> protected classes is not OK, and that even outside complying with the law >>>> one should avoid discrimination, in model building or elsewhere. However, I >>>> disagree that one does this by eliminating from all datasets any feature >>>> that may allude to these protected classes. As Andreas pointed out, there is >>>> a growing effort to ensure that machine learning models are fair and benefit >>>> the common good (such as FATML, DSSG, etc..), and from my understanding the >>>> general consensus isn't necessarily that simply eliminating the feature is >>>> sufficient. I think we are in agreement that naively learning a model over a >>>> feature set containing questionable features and calling it a day is not >>>> okay, but as others have pointed out, having these features present and >>>> handling them appropriately can help guard against the model implicitly >>>> learning unfair ! >> >> biases (e >> ven if they are not explicitly exposed to the feature). >>>> >>>> I would welcome the addition of the Ames dataset to the ones supported >>>> by sklearn, but I'm not convinced that the Boston dataset should be removed. >>>> As Andreas pointed out, there is a benefit to having canonical examples >>>> present so that beginners can easily follow along with the many tutorials >>>> that have been written using them. As Sean points out, the paper itself is >>>> trying to pull out the connection between house price and clean air in the >>>> presence of possible confounding variables. In a more general sense, saying >>>> that a feature shouldn't be there because a simple linear regression is >>>> unaffected by the results is a bit odd because it is very common for >>>> datasets to include irrelevant features, and handling them appropriately is >>>> important. In addition, one could argue that having this type of issue arise >>>> in a toy dataset has a benefit because it exposes these types of issues to >>>> those learning data science earlier on and allows them to keep these issues >>>> in mind in the future! > > > ---------- > From: Gael Varoquaux > Date: Fri, Jul 7, 2017 at 6:35 AM > To: Scikit-learn user and developer mailing list > > > Many people gave great points in this thread, in particular Jacob's well > written email. > > Andy's point about tutorials is an important one. I don't resonate at > all with Juan's message. Breaking people's code, even if it is the notes > that they use to give a lecture, is a real cost for them. The cost varies > on a case to case basis. But there are still books printed out there > that demo image processing on Lena, and these will be out for decades. > More importantly, the replacement of Lena used in scipy (the raccoon) > does not allow to demonstrate denoising properly (Lena has smooth regions > with details in the middle: the eyes), or segmentation. In effect, it has > made the examples for the ecosystem less convincing. > > > Of course, by definition, refusing to change anything implies that > unfortunate situations, such as discriminatory biases, cannot be fixed. > This is why changes should be considered on a case-to-case basis. > > The problem that we are facing here is that a dataset about society, the > Boston housing dataset, can reveal discrimination. However, this is true > of every data about society. The classic adult data (extracted from the > American census) easily reveals income discrimination. I teach statistics > with an IQ dataset where it is easy to show a male vs female IQ > difference. This difference disappears after controlling for education > (and the purpose of my course is to teach people to control for > confounding effects). > > Data about society reveals its inequalities. Not working on such data is > hiding problems, not fixing them. It is true that misuse of such data can > attempt to establish inequalities as facts of life and get them accepted. > When discussing these issues, we need to educate people about how to run > and interpret analyses. > > > No the Boston data will not go. No it is not a good thing to pretend that > social problems do not exist. > > > Ga?l > -- > Gael Varoquaux > Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > > ---------- > From: Juan Nunez-Iglesias > Date: Sat, Jul 8, 2017 at 3:44 AM > To: Scikit-learn user and developer mailing list > > > Just to clarify a couple of things about my position. > > First, thanks Ga?l for a thoughtful response. I fully respect your decision > to keep the Boston dataset, and I agree that it can be a useful "teaching > moment." (As I suggested in my earlier post.) > > With regards to breaking tutorials, however, I totally disagree. The whole > value of tutorials is that they teach general principles, not analysis of > specific datasets. Changing a tutorial dataset is thus different from > changing an API. This isn't the right forum for a discussion about the > ethics of the Lena image, so I won't go into that, but to suggest that it is > a uniquely effective picture, the natural image equivalent of a standard > test pattern, is ludicrous. Maybe the replacement wasn't as good, but that > is a criticism of the choice of replacement, not of the decision to replace > it. There clearly exist millions or billions of images with similarly good > teaching characteristics. > > Finally, yes, removing and deprecating datasets incurs (and inflicts) a real > cost, but cost should be at best a minor consideration when dealing with > ethical questions. History, and daily life, are replete with unethical > decisions made under the excuse that it would cost too much to do what's > right. Ultimately the costs are usually found to have been exaggerated. > > With regards to this dataset, I cede the argument to maintainers, > contributors, and users of the dataset, but I will point out that none of > the existing tutorials in the library mention this feature, let alone > addresses the ethics of it. The DESCR field mentions it entirely > nonchalantly, like it is a natural thing to want to measure if one wants to > predict house prices. I think I would certainly have a WTF moment, at least, > if I was a black student reading through that description. > > Juan. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ---------- > From: Jacob Schreiber > Date: Sat, Jul 8, 2017 at 5:26 AM > To: Scikit-learn user and developer mailing list > > > We would welcome a pull request amending the documentation to include a > neutral discussion of the issues you've brought up. Optimally, it would > include many of the points brought up in this discussion as to why it was > ultimately kept despite the issues being raised. > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > -- Valia Rodriguez, MD PhD Neurophysiology Lecturer. School of Life and Health Sciences, Aston University Professor of Clinical Neurophysiology, Cuban Neuroscience Center From ross at cgl.ucsf.edu Sun Jul 9 20:13:47 2017 From: ross at cgl.ucsf.edu (Bill Ross) Date: Sun, 9 Jul 2017 17:13:47 -0700 Subject: [scikit-learn] Replacing the Boston Housing Prices dataset In-Reply-To: <32b9ea32-b5dc-dfbe-04ca-36e8db30160e@cgl.ucsf.edu> References: <61B34F59-142E-4851-9B27-7DC2A0C2DAF8@gmail.com> <32b9ea32-b5dc-dfbe-04ca-36e8db30160e@cgl.ucsf.edu> Message-ID: <65531062-c9d6-ce7f-b712-0a1abd3cd935@cgl.ucsf.edu> Possibly of interest: Race and ethnicity Imputation from Disease history with Deep LEarning https://github.com/jisungk/riddle Bill On 7/6/17 6:00 PM, Bill Ross wrote: > Unless the data concretely promotes discrimination, it seems > discriminatory to exclude it. > > Bill > > On 7/6/17 5:39 PM, Sebastian Raschka wrote: >> I think there can be some middle ground. I.e., adding a new, simple >> dataset to demonstrate regression (maybe autmpg, wine quality, or sth >> like that) and use that for the scikit-learn examples in the main >> documentation etc but leave the boston dataset in the code base for >> now. Whether it's a weak argument or not, it would be quite >> destructive to remove the dataset altogether in the next version or >> so, not only because old tutorials use it but many unit tests in many >> different projects depend on it. I think it might be better to phase >> it out by having a good alternative first, and I am sure that the >> scikit-learn maintainers wouldn't have anything against it if someone >> would update the examples/tutorials with the use of different datasets >> >> Best, >> Sebastian >> >>> On Jul 6, 2017, at 7:36 PM, Juan Nunez-Iglesias >>> wrote: >>> >>> For what it's worth: I'm sympathetic to the argument that you can't >>> fix the problem if you don't measure it, but I agree with Tony that >>> "many tutorials use it" is an extremely weak argument. We removed >>> Lena from scikit-image because it was the right thing to do. I very >>> much doubt that Boston house prices is in more widespread use than >>> Lena was in image processing. >>> >>> You can argue about whether or not it's morally right or wrong to >>> include the dataset. I see merit to both arguments. But "too many >>> tutorials use it" is very similar in flavour to "the economy of the >>> South would collapse without slavery." >>> >>> Regarding fair uses of the feature, I would hope that all sklearn >>> tutorials using the dataset mention such uses. The potential for >>> abuse and misinterpretation is enormous. >>> >>> On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber >>> , wrote: >>>> Hi Tony >>>> >>>> As others have pointed out, I think that you may be >>>> misunderstanding the purpose of that "feature." We are in agreement >>>> that discrimination against protected classes is not OK, and that >>>> even outside complying with the law one should avoid >>>> discrimination, in model building or elsewhere. However, I disagree >>>> that one does this by eliminating from all datasets any feature >>>> that may allude to these protected classes. As Andreas pointed out, >>>> there is a growing effort to ensure that machine learning models >>>> are fair and benefit the common good (such as FATML, DSSG, etc..), >>>> and from my understanding the general consensus isn't necessarily >>>> that simply eliminating the feature is sufficient. I think we are >>>> in agreement that naively learning a model over a feature set >>>> containing questionable features and calling it a day is not okay, >>>> but as others have pointed out, having these features present and >>>> handling them appropriately can help guard against the model >>>> implicitly learning unfair! > ! >> biases (e >> ven if they are not explicitly exposed to the feature). >>>> I would welcome the addition of the Ames dataset to the ones >>>> supported by sklearn, but I'm not convinced that the Boston dataset >>>> should be removed. As Andreas pointed out, there is a benefit to >>>> having canonical examples present so that beginners can easily >>>> follow along with the many tutorials that have been written using >>>> them. As Sean points out, the paper itself is trying to pull out >>>> the connection between house price and clean air in the presence of >>>> possible confounding variables. In a more general sense, saying >>>> that a feature shouldn't be there because a simple linear >>>> regression is unaffected by the results is a bit odd because it is >>>> very common for datasets to include irrelevant features, and >>>> handling them appropriately is important. In addition, one could >>>> argue that having this type of issue arise in a toy dataset has a >>>> benefit because it exposes these types of issues to those learning >>>> data science earlier on and allows them to keep these issues in >>>> mind in the futur! > e! >> when the >> data is more serious. >>>> It is important for us all to keep issues of fairness in mind when >>>> it comes to data science. I'm glad that you're speaking out in >>>> favor of fairness and trying to bring attention to it. >>>> >>>> Jacob >>>> >>>> On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante >>>> wrote: >>>> G Reina >>>> you make a bizarre argument. You argue that you should not even >>>> check racism as a possible factor in house prices? >>>> >>>> But then you yourself check whether its relevant >>>> Then you say >>>> >>>> "but I'd argue that it's more due to the location (near water, near >>>> businesses, near restaurants, near parks and recreation) than to >>>> the ethnic makeup" >>>> >>>> Which was basically what the original authors wanted to show too, >>>> >>>> Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for >>>> clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. >>>> >>>> but unless you measure ethnic make-up you cannot show that it is >>>> not a confounder. >>>> >>>> The term "white flight" refers to affluent white families moving to >>>> the suburbs.. And clearly a question is whether/how much was racism >>>> or avoiding air pollution. >>>> >>>> >>>> >>>> >>>> >>>> On 6 Jul 2017 6:10 pm, "G Reina" wrote: >>>> I'd like to request that the "Boston Housing Prices" dataset in >>>> sklearn (sklearn.datasets.load_boston) be replaced with the "Ames >>>> Housing Prices" dataset >>>> (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am >>>> willing to submit the code change if the developers agree. >>>> >>>> The Boston dataset has the feature "Bk is the proportion of blacks >>>> in town". It is an incredibly racist "feature" to include in any >>>> dataset. I think is beneath us as data scientists. >>>> >>>> I submit that the Ames dataset is a viable alternative for learning >>>> regression. The author has shown that the dataset is a more robust >>>> replacement for Boston. Ames is a 2011 regression dataset on >>>> housing prices and has more than 5 times the amount of training >>>> examples with over 7 times as many features (none of which are >>>> morally questionable). >>>> >>>> I welcome the community's thoughts on the matter. >>>> >>>> Thanks. >>>> -Tony >>>> >>>> Here's an article I wrote on the Boston dataset: >>>> https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D >>>> >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From ross at cgl.ucsf.edu Sun Jul 9 20:53:02 2017 From: ross at cgl.ucsf.edu (Bill Ross) Date: Sun, 9 Jul 2017 17:53:02 -0700 Subject: [scikit-learn] Replacing the Boston Housing Prices dataset In-Reply-To: <65531062-c9d6-ce7f-b712-0a1abd3cd935@cgl.ucsf.edu> References: <61B34F59-142E-4851-9B27-7DC2A0C2DAF8@gmail.com> <32b9ea32-b5dc-dfbe-04ca-36e8db30160e@cgl.ucsf.edu> <65531062-c9d6-ce7f-b712-0a1abd3cd935@cgl.ucsf.edu> Message-ID: <87933999-964e-0684-2c67-0ec105748250@cgl.ucsf.edu> And more to the point the discussion on Reddit: https://www.reddit.com/r/MachineLearning/comments/6m8tp0/p_deep_learning_for_estimating_race_and_ethnicity/ Bill On 7/9/17 5:13 PM, Bill Ross wrote: > > Possibly of interest: > > Race and ethnicity Imputation from Disease history with Deep LEarning > > https://github.com/jisungk/riddle > > Bill > > On 7/6/17 6:00 PM, Bill Ross wrote: >> Unless the data concretely promotes discrimination, it seems >> discriminatory to exclude it. >> >> Bill >> >> On 7/6/17 5:39 PM, Sebastian Raschka wrote: >>> I think there can be some middle ground. I.e., adding a new, simple >>> dataset to demonstrate regression (maybe autmpg, wine quality, or >>> sth like that) and use that for the scikit-learn examples in the >>> main documentation etc but leave the boston dataset in the code base >>> for now. Whether it's a weak argument or not, it would be quite >>> destructive to remove the dataset altogether in the next version or >>> so, not only because old tutorials use it but many unit tests in >>> many different projects depend on it. I think it might be better to >>> phase it out by having a good alternative first, and I am sure that >>> the scikit-learn maintainers wouldn't have anything against it if >>> someone would update the examples/tutorials with the use of >>> different datasets >>> >>> Best, >>> Sebastian >>> >>>> On Jul 6, 2017, at 7:36 PM, Juan Nunez-Iglesias >>>> wrote: >>>> >>>> For what it's worth: I'm sympathetic to the argument that you can't >>>> fix the problem if you don't measure it, but I agree with Tony that >>>> "many tutorials use it" is an extremely weak argument. We removed >>>> Lena from scikit-image because it was the right thing to do. I very >>>> much doubt that Boston house prices is in more widespread use than >>>> Lena was in image processing. >>>> >>>> You can argue about whether or not it's morally right or wrong to >>>> include the dataset. I see merit to both arguments. But "too many >>>> tutorials use it" is very similar in flavour to "the economy of the >>>> South would collapse without slavery." >>>> >>>> Regarding fair uses of the feature, I would hope that all sklearn >>>> tutorials using the dataset mention such uses. The potential for >>>> abuse and misinterpretation is enormous. >>>> >>>> On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber >>>> , wrote: >>>>> Hi Tony >>>>> >>>>> As others have pointed out, I think that you may be >>>>> misunderstanding the purpose of that "feature." We are in >>>>> agreement that discrimination against protected classes is not OK, >>>>> and that even outside complying with the law one should avoid >>>>> discrimination, in model building or elsewhere. However, I >>>>> disagree that one does this by eliminating from all datasets any >>>>> feature that may allude to these protected classes. As Andreas >>>>> pointed out, there is a growing effort to ensure that machine >>>>> learning models are fair and benefit the common good (such as >>>>> FATML, DSSG, etc..), and from my understanding the general >>>>> consensus isn't necessarily that simply eliminating the feature is >>>>> sufficient. I think we are in agreement that naively learning a >>>>> model over a feature set containing questionable features and >>>>> calling it a day is not okay, but as others have pointed out, >>>>> having these features present and handling them appropriately can >>>>> help guard against the model implicitly learning unfair! >> ! >>> biases (e >>> ven if they are not explicitly exposed to the feature). >>>>> I would welcome the addition of the Ames dataset to the ones >>>>> supported by sklearn, but I'm not convinced that the Boston >>>>> dataset should be removed. As Andreas pointed out, there is a >>>>> benefit to having canonical examples present so that beginners can >>>>> easily follow along with the many tutorials that have been written >>>>> using them. As Sean points out, the paper itself is trying to pull >>>>> out the connection between house price and clean air in the >>>>> presence of possible confounding variables. In a more general >>>>> sense, saying that a feature shouldn't be there because a simple >>>>> linear regression is unaffected by the results is a bit odd >>>>> because it is very common for datasets to include irrelevant >>>>> features, and handling them appropriately is important. In >>>>> addition, one could argue that having this type of issue arise in >>>>> a toy dataset has a benefit because it exposes these types of >>>>> issues to those learning data science earlier on and allows them >>>>> to keep these issues in mind in the futur! >> e! >>> when the >>> data is more serious. >>>>> It is important for us all to keep issues of fairness in mind when >>>>> it comes to data science. I'm glad that you're speaking out in >>>>> favor of fairness and trying to bring attention to it. >>>>> >>>>> Jacob >>>>> >>>>> On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante >>>>> wrote: >>>>> G Reina >>>>> you make a bizarre argument. You argue that you should not even >>>>> check racism as a possible factor in house prices? >>>>> >>>>> But then you yourself check whether its relevant >>>>> Then you say >>>>> >>>>> "but I'd argue that it's more due to the location (near water, >>>>> near businesses, near restaurants, near parks and recreation) than >>>>> to the ethnic makeup" >>>>> >>>>> Which was basically what the original authors wanted to show too, >>>>> >>>>> Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand >>>>> for clean air', J. Environ. Economics & Management, vol.5, 81-102, >>>>> 1978. >>>>> >>>>> but unless you measure ethnic make-up you cannot show that it is >>>>> not a confounder. >>>>> >>>>> The term "white flight" refers to affluent white families moving >>>>> to the suburbs.. And clearly a question is whether/how much was >>>>> racism or avoiding air pollution. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 6 Jul 2017 6:10 pm, "G Reina" wrote: >>>>> I'd like to request that the "Boston Housing Prices" dataset in >>>>> sklearn (sklearn.datasets.load_boston) be replaced with the "Ames >>>>> Housing Prices" dataset >>>>> (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am >>>>> willing to submit the code change if the developers agree. >>>>> >>>>> The Boston dataset has the feature "Bk is the proportion of blacks >>>>> in town". It is an incredibly racist "feature" to include in any >>>>> dataset. I think is beneath us as data scientists. >>>>> >>>>> I submit that the Ames dataset is a viable alternative for >>>>> learning regression. The author has shown that the dataset is a >>>>> more robust replacement for Boston. Ames is a 2011 regression >>>>> dataset on housing prices and has more than 5 times the amount of >>>>> training examples with over 7 times as many features (none of >>>>> which are morally questionable). >>>>> >>>>> I welcome the community's thoughts on the matter. >>>>> >>>>> Thanks. >>>>> -Tony >>>>> >>>>> Here's an article I wrote on the Boston dataset: >>>>> https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From grhanceylan at gmail.com Mon Jul 10 10:58:52 2017 From: grhanceylan at gmail.com (=?UTF-8?Q?G=C3=BCrhan_Ceylan?=) Date: Mon, 10 Jul 2017 17:58:52 +0300 Subject: [scikit-learn] Contribution Message-ID: Hi everyone, I am wondering, How can I use external optimization algorithms with scikit-learn, for instance neural network , instead of defined algorithms ( Stochastic Gradient Descent, Adam, or L-BFGS). Furthermore, I want to introduce a new unconstrained optimization algorithm to scikit-learn, implementation of the algorithm and related paper can be found here . I couldn't find any explanation , about the situation. Do you have defined procedure to make such kind of contributions? If this is not the case, How should I start to make such a proposal/contribution ? Kind regards, G?rhan C. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Mon Jul 10 12:01:39 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Mon, 10 Jul 2017 09:01:39 -0700 Subject: [scikit-learn] Contribution In-Reply-To: References: Message-ID: Howdy This question and the one right after in the FAQ are probably relevant re: inclusion of new algorithms: http://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms. The gist is that we only include well established algorithms, and there are no end to those. I think it is unlikely that a PR will get merged with a cutting edge new algorithm, as the scope of scikit-learn isn't necessary "the latest" as opposed to "the classics." You may also consider writing a scikit-contrib package that basically creates what you're interested in in scikit-learn format, but external to the project. We'd be more than happy to link to it. If the algorithm becomes a smashing success over time, we'd reconsider adding it to the main code base. As to your first question, you should check out how the current optimizers are written for the algorithm you're interested in. I don't think there's a plug and play way to drop in your own optimizer like many deep learning packages support, unfortunately. You'd probably have to modify the code directly to support your own. Let me know if you have any other questions. Jacob On Mon, Jul 10, 2017 at 7:58 AM, G?rhan Ceylan wrote: > Hi everyone, > > I am wondering, How can I use external optimization algorithms with scikit-learn, > for instance neural network > > , instead of defined algorithms ( Stochastic Gradient Descent, Adam, or > L-BFGS). > > Furthermore, I want to introduce a new unconstrained optimization > algorithm to scikit-learn, implementation of the algorithm and related paper > can be found here . > > I couldn't find any explanation > , about the > situation. Do you have defined procedure to make such kind of > contributions? If this is not the case, How should I start to make such a > proposal/contribution ? > > > Kind regards, > > G?rhan C. > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vaggi.federico at gmail.com Mon Jul 10 12:10:09 2017 From: vaggi.federico at gmail.com (federico vaggi) Date: Mon, 10 Jul 2017 16:10:09 +0000 Subject: [scikit-learn] Contribution In-Reply-To: References: Message-ID: Hey Gurhan, sklearn doesn't really neatly separate optimizers from the models they optimize at the level of API (except in a few cases). In order to make the package more friendly to newer user, each model has excellent optimizer defaults that you can use, and only in a few cases does it make sense to tweak the optimization routines (for example, SAGA if you have a very large dataset when doing logistic regression). There is a fantastic library called lightning where the optimization routines are first class citizens: http://contrib.scikit-learn.org/lightning/ - you can take a look there. However, lightning focuses on convex optimization, so most algorithms have provable convergence rates. Good luck! On Mon, 10 Jul 2017 at 09:05 Jacob Schreiber wrote: > Howdy > > This question and the one right after in the FAQ are probably relevant re: > inclusion of new algorithms: > http://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms. > The gist is that we only include well established algorithms, and there are > no end to those. I think it is unlikely that a PR will get merged with a > cutting edge new algorithm, as the scope of scikit-learn isn't necessary > "the latest" as opposed to "the classics." You may also consider writing a > scikit-contrib package that basically creates what you're interested in in > scikit-learn format, but external to the project. We'd be more than happy > to link to it. If the algorithm becomes a smashing success over time, we'd > reconsider adding it to the main code base. > > As to your first question, you should check out how the current optimizers > are written for the algorithm you're interested in. I don't think there's a > plug and play way to drop in your own optimizer like many deep learning > packages support, unfortunately. You'd probably have to modify the code > directly to support your own. > > Let me know if you have any other questions. > > Jacob > > On Mon, Jul 10, 2017 at 7:58 AM, G?rhan Ceylan > wrote: > >> Hi everyone, >> >> I am wondering, How can I use external optimization algorithms with scikit-learn, >> for instance neural network >> >> , instead of defined algorithms ( Stochastic Gradient Descent, Adam, or >> L-BFGS). >> >> Furthermore, I want to introduce a new unconstrained optimization >> algorithm to scikit-learn, implementation of the algorithm and related paper >> can be found here . >> >> I couldn't find any explanation >> , about the >> situation. Do you have defined procedure to make such kind of >> contributions? If this is not the case, How should I start to make such a >> proposal/contribution ? >> >> >> Kind regards, >> >> G?rhan C. >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From uri at goren4u.com Mon Jul 10 13:32:42 2017 From: uri at goren4u.com (Uri Goren) Date: Mon, 10 Jul 2017 20:32:42 +0300 Subject: [scikit-learn] Contribution In-Reply-To: References: Message-ID: Hi, I'd like to implement the Markov clustering algorithm, Any objections? On Jul 10, 2017 7:10 PM, "federico vaggi" wrote: Hey Gurhan, sklearn doesn't really neatly separate optimizers from the models they optimize at the level of API (except in a few cases). In order to make the package more friendly to newer user, each model has excellent optimizer defaults that you can use, and only in a few cases does it make sense to tweak the optimization routines (for example, SAGA if you have a very large dataset when doing logistic regression). There is a fantastic library called lightning where the optimization routines are first class citizens: http://contrib. scikit-learn.org/lightning/ - you can take a look there. However, lightning focuses on convex optimization, so most algorithms have provable convergence rates. Good luck! On Mon, 10 Jul 2017 at 09:05 Jacob Schreiber wrote: > Howdy > > This question and the one right after in the FAQ are probably relevant re: > inclusion of new algorithms: http://scikit-learn.org/stable/faq.html# > what-are-the-inclusion-criteria-for-new-algorithms. The gist is that we > only include well established algorithms, and there are no end to those. I > think it is unlikely that a PR will get merged with a cutting edge new > algorithm, as the scope of scikit-learn isn't necessary "the latest" as > opposed to "the classics." You may also consider writing a scikit-contrib > package that basically creates what you're interested in in scikit-learn > format, but external to the project. We'd be more than happy to link to it. > If the algorithm becomes a smashing success over time, we'd reconsider > adding it to the main code base. > > As to your first question, you should check out how the current optimizers > are written for the algorithm you're interested in. I don't think there's a > plug and play way to drop in your own optimizer like many deep learning > packages support, unfortunately. You'd probably have to modify the code > directly to support your own. > > Let me know if you have any other questions. > > Jacob > > On Mon, Jul 10, 2017 at 7:58 AM, G?rhan Ceylan > wrote: > >> Hi everyone, >> >> I am wondering, How can I use external optimization algorithms with scikit-learn, >> for instance neural network >> >> , instead of defined algorithms ( Stochastic Gradient Descent, Adam, or >> L-BFGS). >> >> Furthermore, I want to introduce a new unconstrained optimization >> algorithm to scikit-learn, implementation of the algorithm and related paper >> can be found here . >> >> I couldn't find any explanation >> , about the >> situation. Do you have defined procedure to make such kind of >> contributions? If this is not the case, How should I start to make such a >> proposal/contribution ? >> >> >> Kind regards, >> >> G?rhan C. >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From zephyr14 at gmail.com Mon Jul 10 13:37:12 2017 From: zephyr14 at gmail.com (Vlad Niculae) Date: Mon, 10 Jul 2017 13:37:12 -0400 Subject: [scikit-learn] Contribution In-Reply-To: References: Message-ID: <20170710173712.bt6nigii5icmihgl@vladn-desktop> On Mon, Jul 10, 2017 at 04:10:09PM +0000, federico vaggi wrote: > There is a fantastic library called lightning where the optimization > routines are first class citizens: > http://contrib.scikit-learn.org/lightning/ - you can take a look there. > However, lightning focuses on convex optimization, so most algorithms have > provable convergence rates. Hi, I fully agree that lightning is fantastic :) but it might not be what G?rhan wants. It's true that lightning's api is designed around optimizers rather than around models. So where in scikit-learn we usually have, e.g., LogisticRegression(solver='sag') in lightning you would have SAGClassifier(loss='log') to achieve something close. But neither library has the oo-style separation between freeform models and optimizers such as you might find in deep learning frameworks. So, for instance, it's relatively easy to add a new loss function to the lightning SAGClassifier, but you would still be able to only use it with a linear model. This is by design in both scikit-learn and lightning, at least at the moment: by making these kinds of assumptions about the models, implementations can be much more efficient in terms of computation and storage, especially when sparse data is involved. Yours, Vlad > > Good luck! > > On Mon, 10 Jul 2017 at 09:05 Jacob Schreiber > wrote: > > > Howdy > > > > This question and the one right after in the FAQ are probably relevant re: > > inclusion of new algorithms: > > http://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms. > > The gist is that we only include well established algorithms, and there are > > no end to those. I think it is unlikely that a PR will get merged with a > > cutting edge new algorithm, as the scope of scikit-learn isn't necessary > > "the latest" as opposed to "the classics." You may also consider writing a > > scikit-contrib package that basically creates what you're interested in in > > scikit-learn format, but external to the project. We'd be more than happy > > to link to it. If the algorithm becomes a smashing success over time, we'd > > reconsider adding it to the main code base. > > > > As to your first question, you should check out how the current optimizers > > are written for the algorithm you're interested in. I don't think there's a > > plug and play way to drop in your own optimizer like many deep learning > > packages support, unfortunately. You'd probably have to modify the code > > directly to support your own. > > > > Let me know if you have any other questions. > > > > Jacob > > > > On Mon, Jul 10, 2017 at 7:58 AM, G?rhan Ceylan > > wrote: > > > >> Hi everyone, > >> > >> I am wondering, How can I use external optimization algorithms with scikit-learn, > >> for instance neural network > >> > >> , instead of defined algorithms ( Stochastic Gradient Descent, Adam, or > >> L-BFGS). > >> > >> Furthermore, I want to introduce a new unconstrained optimization > >> algorithm to scikit-learn, implementation of the algorithm and related paper > >> can be found here . > >> > >> I couldn't find any explanation > >> , about the > >> situation. Do you have defined procedure to make such kind of > >> contributions? If this is not the case, How should I start to make such a > >> proposal/contribution ? > >> > >> > >> Kind regards, > >> > >> G?rhan C. > >> > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From uri at goren4u.com Mon Jul 10 17:40:19 2017 From: uri at goren4u.com (Uri Goren) Date: Tue, 11 Jul 2017 00:40:19 +0300 Subject: [scikit-learn] Contribution - Markov Clustering Message-ID: Hi, I've been advised to contact you before working on an implementation of a new feature. I am thinking of implementing the Markov clustering and add it to sklearn.cluster module. See: https://micans.org/mcl/ https://gist.github.com/urigoren/1f76567f3af56ed8c33f076537768a60 Do you know if anyone else has started working on it ? Would you advise against it for some reason ? Thank you, Uri -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexandre.gramfort at telecom-paristech.fr Mon Jul 10 23:00:45 2017 From: alexandre.gramfort at telecom-paristech.fr (Alexandre Gramfort) Date: Tue, 11 Jul 2017 05:00:45 +0200 Subject: [scikit-learn] Contribution - Markov Clustering In-Reply-To: References: Message-ID: hi, did you have a look at : http://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: From uri at goren4u.com Tue Jul 11 00:36:30 2017 From: uri at goren4u.com (Uri Goren) Date: Tue, 11 Jul 2017 07:36:30 +0300 Subject: [scikit-learn] Contribution - Markov Clustering In-Reply-To: References: Message-ID: I have, The only criterion that I am unsure about is the number citations. In the literature Markov clustering is usually compared to affinity prolongation, which also has a similar number of citations. I have attached my implementation in my github account for you to review. Do I have your approval to make it a pull request? On Jul 11, 2017 6:00 AM, "Alexandre Gramfort" < alexandre.gramfort at telecom-paristech.fr> wrote: > hi, > > did you have a look at : > > http://scikit-learn.org/stable/faq.html#what-are-the- > inclusion-criteria-for-new-algorithms > > Alex > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Tue Jul 11 12:03:34 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Tue, 11 Jul 2017 09:03:34 -0700 Subject: [scikit-learn] Contribution - Markov Clustering In-Reply-To: References: Message-ID: You don't need our permission to submit a PR, go ahead! We welcome PRs. On Mon, Jul 10, 2017 at 9:36 PM, Uri Goren wrote: > I have, > The only criterion that I am unsure about is the number citations. > > In the literature Markov clustering is usually compared to affinity > prolongation, which also has a similar number of citations. > > I have attached my implementation in my github account for you to review. > > Do I have your approval to make it a pull request? > > > > > On Jul 11, 2017 6:00 AM, "Alexandre Gramfort" paristech.fr> wrote: > >> hi, >> >> did you have a look at : >> >> http://scikit-learn.org/stable/faq.html#what-are-the-inclusi >> on-criteria-for-new-algorithms >> >> Alex >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From b.noushin7 at gmail.com Tue Jul 11 12:42:15 2017 From: b.noushin7 at gmail.com (Ariani A) Date: Tue, 11 Jul 2017 12:42:15 -0400 Subject: [scikit-learn] Agglomerative clustering problem Message-ID: Hi all, I want to perform agglomerative clustering, but I have no idea of number of clusters before hand. But I want that every cluster has at least 40 data points in it. How can I apply this to sklearn.agglomerative clustering? Should I use dendrogram and cut it somehow? I have no idea how to relate dendrogram to this and cutting it out. Any help will be appreciated! I have to use agglomerative clustering! Thanks, -Ariani -------------- next part -------------- An HTML attachment was scrubbed... URL: From uri at goren4u.com Tue Jul 11 13:54:12 2017 From: uri at goren4u.com (Uri Goren) Date: Tue, 11 Jul 2017 20:54:12 +0300 Subject: [scikit-learn] Agglomerative clustering problem In-Reply-To: References: Message-ID: Take a look at scipy's fcluster function. If M is a matrix of all of your feature vectors, this code snippet should work. You need to figure out what metric and algorithm work for you from sklearn.metrics import pairwise_distance from scipy.cluster import hierarchy X = pairwise_distance(M, metric=metric) Z = hierarchy.linkage(X, algo, metric=metric) C = hierarchy.fcluster(Z,threshold, criterion="distance") Best, Uri Goren On Tue, Jul 11, 2017 at 7:42 PM, Ariani A wrote: > Hi all, > I want to perform agglomerative clustering, but I have no idea of number > of clusters before hand. But I want that every cluster has at least 40 > data points in it. How can I apply this to sklearn.agglomerative clusteri > ng? > Should I use dendrogram and cut it somehow? I have no idea how to relate > dendrogram to this and cutting it out. Any help will be appreciated! > I have to use agglomerative clustering! > Thanks, > -Ariani > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- *Uri Goren,Software innovator* *Phone: +972-507-649-650* *EMail: uri at goren4u.com * *Linkedin: il.linkedin.com/in/ugoren/ * -------------- next part -------------- An HTML attachment was scrubbed... URL: From b.noushin7 at gmail.com Tue Jul 11 14:22:43 2017 From: b.noushin7 at gmail.com (Ariani A) Date: Tue, 11 Jul 2017 14:22:43 -0400 Subject: [scikit-learn] Agglomerative clustering problem In-Reply-To: References: Message-ID: ?Dear Uri, Thanks. I just have a pairwise distance matrix and I want to implement it so that each cluster has at least 40 data points. (in Agglomerative). Does it work? Thanks, -Ariani On Tue, Jul 11, 2017 at 1:54 PM, Uri Goren wrote: > Take a look at scipy's fcluster function. > If M is a matrix of all of your feature vectors, this code snippet should > work. > > You need to figure out what metric and algorithm work for you > > from sklearn.metrics import pairwise_distance > from scipy.cluster import hierarchy > X = pairwise_distance(M, metric=metric) > Z = hierarchy.linkage(X, algo, metric=metric) > C = hierarchy.fcluster(Z,threshold, criterion="distance") > > Best, > Uri Goren > > On Tue, Jul 11, 2017 at 7:42 PM, Ariani A wrote: > >> Hi all, >> I want to perform agglomerative clustering, but I have no idea of number >> of clusters before hand. But I want that every cluster has at least 40 >> data points in it. How can I apply this to sklearn.agglomerative clusteri >> ng? >> Should I use dendrogram and cut it somehow? I have no idea how to relate >> dendrogram to this and cutting it out. Any help will be appreciated! >> I have to use agglomerative clustering! >> Thanks, >> -Ariani >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > > > *Uri Goren,Software innovator* > > *Phone: +972-507-649-650* > > *EMail: uri at goren4u.com * > *Linkedin: il.linkedin.com/in/ugoren/ * > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Tue Jul 11 17:04:14 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Tue, 11 Jul 2017 23:04:14 +0200 Subject: [scikit-learn] Contribution - Markov Clustering In-Reply-To: References: Message-ID: If this is the first time you contribute, please make sure to carefully read the contributors guide till the end: http://scikit-learn.org/stable/developers/contributing.html In particular, make sure to follow the estimators API conventions for your PR to get a chance to be reviewed. In particular the gist you linked to is not compatible with the scikit-learn estimators API. Personally I have never heard of Markov clustering, so it's hard for me to assess whether it should be included in the project or not. It would really help if you could demonstrate its performance on a publicly available dataset where is does significantly better than all the other clustering algorithms already implemented in scikit-learn (both in terms of training speed and in terms of cluster quality / stability, although this latter point is very domain dependent). As a side note, if this is the first time you contribute to the project, it's probably best to have a look at how other pull requests are being reviewed (by reading the comment threads of other PRs) and maybe start by a small pull request to fix small bug (with a non-regression test) or tackle some documentation issues. Adding new estimators takes a lot of effort to review (we need tests, docs, updated examples) and assume some familiarity with the existing code base. -- Olivier From uri at goren4u.com Wed Jul 12 00:47:57 2017 From: uri at goren4u.com (Uri Goren) Date: Wed, 12 Jul 2017 07:47:57 +0300 Subject: [scikit-learn] Contribution - Markov Clustering In-Reply-To: References: Message-ID: I've added this PR, and I addressed in the comments some of your concerns (publications, comparison to affinity propagation, etc). https://github.com/scikit-learn/scikit-learn/pull/9329 I'd love for you to review, since this is my first PR in the scikit learn repository On Wed, Jul 12, 2017 at 12:04 AM, Olivier Grisel wrote: > If this is the first time you contribute, please make sure to > carefully read the contributors guide till the end: > > http://scikit-learn.org/stable/developers/contributing.html > > In particular, make sure to follow the estimators API conventions for > your PR to get a chance to be reviewed. In particular the gist you > linked to is not compatible with the scikit-learn estimators API. > > Personally I have never heard of Markov clustering, so it's hard for > me to assess whether it should be included in the project or not. It > would really help if you could demonstrate its performance on a > publicly available dataset where is does significantly better than all > the other clustering algorithms already implemented in scikit-learn > (both in terms of training speed and in terms of cluster quality / > stability, although this latter point is very domain dependent). > > As a side note, if this is the first time you contribute to the > project, it's probably best to have a look at how other pull requests > are being reviewed (by reading the comment threads of other PRs) and > maybe start by a small pull request to fix small bug (with a > non-regression test) or tackle some documentation issues. Adding new > estimators takes a lot of effort to review (we need tests, docs, > updated examples) and assume some familiarity with the existing code > base. > > -- > Olivier > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- *Uri Goren,Software innovator* *Phone: +972-507-649-650* *EMail: uri at goren4u.com * *Linkedin: il.linkedin.com/in/ugoren/ * -------------- next part -------------- An HTML attachment was scrubbed... URL: From b.noushin7 at gmail.com Thu Jul 13 15:42:33 2017 From: b.noushin7 at gmail.com (Ariani A) Date: Thu, 13 Jul 2017 15:42:33 -0400 Subject: [scikit-learn] Agglomerative Clustering without knowing number of clusters In-Reply-To: <20170706163257.zgvwnoih5zjb73io@MacBook-Pro-3.local> References: <20170706163257.zgvwnoih5zjb73io@MacBook-Pro-3.local> Message-ID: Dear Shane, Thanks for your answer. Does DBSCAN works with distance matrix/? I have a distance matrix (symmetric matrix which contains pairwise distances). Can you help me? I did not find DBSCAN code in that link. Best, -Ariani On Thu, Jul 6, 2017 at 12:32 PM, Shane Grigsby wrote: > This sounds like it may be a problem more amenable to either DBSCAN or > OPTICS. Both algorithms don't require a priori knowledge of the number of > clusters, and both let you specify a minimum point membership threshold for > cluster membership. The OPTICS algorithm will also produce a dendrogram > that you can cut for sub clusters if need be. > > DBSCAN is part of the stable release and has been for some time; OPTICS is > pending as a pull request, but it's stable and you can try it if you like: > > https://github.com/scikit-learn/scikit-learn/pull/1984 > > Cheers, > Shane > > > On 06/30, Ariani A wrote: > >> I want to perform agglomerative clustering, but I have no idea of number >> of >> clusters before hand. But I want that every cluster has at least 40 data >> points in it. How can I apply this to sklearn.agglomerative clustering? >> Should I use dendrogram and cut it somehow? I have no idea how to relate >> dendrogram to this and cutting it out. Any help will be appreciated! >> > > _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > *PhD candidate & Research Assistant* > *Cooperative Institute for Research in Environmental Sciences (CIRES)* > *University of Colorado at Boulder* > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shane.grigsby at colorado.edu Thu Jul 13 17:38:37 2017 From: shane.grigsby at colorado.edu (Shane Grigsby) Date: Thu, 13 Jul 2017 16:38:37 -0500 Subject: [scikit-learn] Agglomerative Clustering without knowing number of clusters In-Reply-To: References: <20170706163257.zgvwnoih5zjb73io@MacBook-Pro-3.local> Message-ID: <20170713213837.fx7ubmlgzcjex6uv@MacBook-Pro-3.local> Hi Ariani, Yes, you can use a distance matrix-- I think that what you want is metric='precomputed', and then X would be your N by N distance matrix. Hope that helps, ~Shane On 07/13, Ariani A wrote: >Dear Shane, >Thanks for your answer. >Does DBSCAN works with distance matrix/? I have a distance matrix >(symmetric matrix which contains pairwise distances). Can you help me? I >did not find DBSCAN code in that link. >Best, >-Ariani > >On Thu, Jul 6, 2017 at 12:32 PM, Shane Grigsby >wrote: > >> This sounds like it may be a problem more amenable to either DBSCAN or >> OPTICS. Both algorithms don't require a priori knowledge of the number of >> clusters, and both let you specify a minimum point membership threshold for >> cluster membership. The OPTICS algorithm will also produce a dendrogram >> that you can cut for sub clusters if need be. >> >> DBSCAN is part of the stable release and has been for some time; OPTICS is >> pending as a pull request, but it's stable and you can try it if you like: >> >> https://github.com/scikit-learn/scikit-learn/pull/1984 >> >> Cheers, >> Shane >> >> >> On 06/30, Ariani A wrote: >> >>> I want to perform agglomerative clustering, but I have no idea of number >>> of >>> clusters before hand. But I want that every cluster has at least 40 data >>> points in it. How can I apply this to sklearn.agglomerative clustering? >>> Should I use dendrogram and cut it somehow? I have no idea how to relate >>> dendrogram to this and cutting it out. Any help will be appreciated! >>> >> >> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> -- >> *PhD candidate & Research Assistant* >> *Cooperative Institute for Research in Environmental Sciences (CIRES)* >> *University of Colorado at Boulder* >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >_______________________________________________ >scikit-learn mailing list >scikit-learn at python.org >https://mail.python.org/mailman/listinfo/scikit-learn -- *PhD candidate & Research Assistant* *Cooperative Institute for Research in Environmental Sciences (CIRES)* *University of Colorado at Boulder* From b.noushin7 at gmail.com Thu Jul 13 19:03:32 2017 From: b.noushin7 at gmail.com (Ariani A) Date: Thu, 13 Jul 2017 19:03:32 -0400 Subject: [scikit-learn] Agglomerative Clustering without knowing number of clusters In-Reply-To: <20170713213837.fx7ubmlgzcjex6uv@MacBook-Pro-3.local> References: <20170706163257.zgvwnoih5zjb73io@MacBook-Pro-3.local> <20170713213837.fx7ubmlgzcjex6uv@MacBook-Pro-3.local> Message-ID: Dear Shane, Thanks for your prompt answer. Do you mean that for DBSCAN there is no need to feed other parameters? Do I just call the function or I have to manipulate the code? P.S. I was not able to find the DBSCAN code on github. Looking forward to hearing from you. Best, -Noushin On Thu, Jul 13, 2017 at 5:38 PM, Shane Grigsby wrote: > Hi Ariani, > Yes, you can use a distance matrix-- I think that what you want is > metric='precomputed', and then X would be your N by N distance matrix. > Hope that helps, > ~Shane > > > On 07/13, Ariani A wrote: > >> Dear Shane, >> Thanks for your answer. >> Does DBSCAN works with distance matrix/? I have a distance matrix >> (symmetric matrix which contains pairwise distances). Can you help me? I >> did not find DBSCAN code in that link. >> Best, >> -Ariani >> >> On Thu, Jul 6, 2017 at 12:32 PM, Shane Grigsby < >> shane.grigsby at colorado.edu> >> wrote: >> >> This sounds like it may be a problem more amenable to either DBSCAN or >>> OPTICS. Both algorithms don't require a priori knowledge of the number of >>> clusters, and both let you specify a minimum point membership threshold >>> for >>> cluster membership. The OPTICS algorithm will also produce a dendrogram >>> that you can cut for sub clusters if need be. >>> >>> DBSCAN is part of the stable release and has been for some time; OPTICS >>> is >>> pending as a pull request, but it's stable and you can try it if you >>> like: >>> >>> https://github.com/scikit-learn/scikit-learn/pull/1984 >>> >>> Cheers, >>> Shane >>> >>> >>> On 06/30, Ariani A wrote: >>> >>> I want to perform agglomerative clustering, but I have no idea of number >>>> of >>>> clusters before hand. But I want that every cluster has at least 40 data >>>> points in it. How can I apply this to sklearn.agglomerative clustering? >>>> Should I use dendrogram and cut it somehow? I have no idea how to relate >>>> dendrogram to this and cutting it out. Any help will be appreciated! >>>> >>>> >>> _______________________________________________ >>> >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> -- >>> *PhD candidate & Research Assistant* >>> *Cooperative Institute for Research in Environmental Sciences (CIRES)* >>> *University of Colorado at Boulder* >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> > _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > *PhD candidate & Research Assistant* > *Cooperative Institute for Research in Environmental Sciences (CIRES)* > *University of Colorado at Boulder* > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From b.noushin7 at gmail.com Thu Jul 13 19:21:41 2017 From: b.noushin7 at gmail.com (Ariani A) Date: Thu, 13 Jul 2017 19:21:41 -0400 Subject: [scikit-learn] Agglomerative Clustering without knowing number of clusters In-Reply-To: References: <20170706163257.zgvwnoih5zjb73io@MacBook-Pro-3.local> <20170713213837.fx7ubmlgzcjex6uv@MacBook-Pro-3.local> Message-ID: Dear Shane, Sorry bothering you! Is the "precomputed" and "distance matrix" you are talking about, are about "DBSCAN" ? Thanks, Best. On Thu, Jul 13, 2017 at 7:03 PM, Ariani A wrote: > Dear Shane, > Thanks for your prompt answer. > Do you mean that for DBSCAN there is no need to feed other parameters? Do > I just call the function or I have to manipulate the code? > P.S. I was not able to find the DBSCAN code on github. > Looking forward to hearing from you. > Best, > -Noushin > > On Thu, Jul 13, 2017 at 5:38 PM, Shane Grigsby > wrote: > >> Hi Ariani, >> Yes, you can use a distance matrix-- I think that what you want is >> metric='precomputed', and then X would be your N by N distance matrix. >> Hope that helps, >> ~Shane >> >> >> On 07/13, Ariani A wrote: >> >>> Dear Shane, >>> Thanks for your answer. >>> Does DBSCAN works with distance matrix/? I have a distance matrix >>> (symmetric matrix which contains pairwise distances). Can you help me? I >>> did not find DBSCAN code in that link. >>> Best, >>> -Ariani >>> >>> On Thu, Jul 6, 2017 at 12:32 PM, Shane Grigsby < >>> shane.grigsby at colorado.edu> >>> wrote: >>> >>> This sounds like it may be a problem more amenable to either DBSCAN or >>>> OPTICS. Both algorithms don't require a priori knowledge of the number >>>> of >>>> clusters, and both let you specify a minimum point membership threshold >>>> for >>>> cluster membership. The OPTICS algorithm will also produce a dendrogram >>>> that you can cut for sub clusters if need be. >>>> >>>> DBSCAN is part of the stable release and has been for some time; OPTICS >>>> is >>>> pending as a pull request, but it's stable and you can try it if you >>>> like: >>>> >>>> https://github.com/scikit-learn/scikit-learn/pull/1984 >>>> >>>> Cheers, >>>> Shane >>>> >>>> >>>> On 06/30, Ariani A wrote: >>>> >>>> I want to perform agglomerative clustering, but I have no idea of number >>>>> of >>>>> clusters before hand. But I want that every cluster has at least 40 >>>>> data >>>>> points in it. How can I apply this to sklearn.agglomerative clustering? >>>>> Should I use dendrogram and cut it somehow? I have no idea how to >>>>> relate >>>>> dendrogram to this and cutting it out. Any help will be appreciated! >>>>> >>>>> >>>> _______________________________________________ >>>> >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> -- >>>> *PhD candidate & Research Assistant* >>>> *Cooperative Institute for Research in Environmental Sciences (CIRES)* >>>> *University of Colorado at Boulder* >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> -- >> *PhD candidate & Research Assistant* >> *Cooperative Institute for Research in Environmental Sciences (CIRES)* >> *University of Colorado at Boulder* >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From grhanceylan at gmail.com Fri Jul 14 02:46:21 2017 From: grhanceylan at gmail.com (=?UTF-8?Q?G=C3=BCrhan_Ceylan?=) Date: Fri, 14 Jul 2017 09:46:21 +0300 Subject: [scikit-learn] Contribution In-Reply-To: <20170710173712.bt6nigii5icmihgl@vladn-desktop> References: <20170710173712.bt6nigii5icmihgl@vladn-desktop> Message-ID: @Jacob, I understand your concern about the new algorithms. It will be lost effort to make coding, updating, documentation for a unsuccessful algorithm. Thanks for the tips. @federico, lightning library is, close to what in my mind but not the same. I think, there should be an easy way to see how optimizers affects learning algorithms. Thanks for the link. @Vlad, Thank you, for the clarification. Best, G?rhan 2017-07-10 20:37 GMT+03:00 Vlad Niculae : > On Mon, Jul 10, 2017 at 04:10:09PM +0000, federico vaggi wrote: > > There is a fantastic library called lightning where the optimization > > routines are first class citizens: > > http://contrib.scikit-learn.org/lightning/ - you can take a look there. > > However, lightning focuses on convex optimization, so most algorithms > have > > provable convergence rates. > > Hi, > > I fully agree that lightning is fantastic :) but it might not be what > G?rhan > wants. > > It's true that lightning's api is designed around optimizers rather > than around models. So where in scikit-learn we usually have, e.g., > > LogisticRegression(solver='sag') > > in lightning you would have > > SAGClassifier(loss='log') > > to achieve something close. But neither library has the oo-style > separation between freeform models and optimizers such as you might > find in deep learning frameworks. So, for instance, it's relatively > easy to add a new loss function to the lightning SAGClassifier, but > you would still be able to only use it with a linear model. > > This is by design in both scikit-learn and lightning, at least at the > moment: by making these kinds of assumptions about the models, > implementations can be much more efficient in terms of computation and > storage, especially when sparse data is involved. > > Yours, > Vlad > > > > > Good luck! > > > > On Mon, 10 Jul 2017 at 09:05 Jacob Schreiber > > wrote: > > > > > Howdy > > > > > > This question and the one right after in the FAQ are probably relevant > re: > > > inclusion of new algorithms: > > > http://scikit-learn.org/stable/faq.html#what-are-the- > inclusion-criteria-for-new-algorithms. > > > The gist is that we only include well established algorithms, and > there are > > > no end to those. I think it is unlikely that a PR will get merged with > a > > > cutting edge new algorithm, as the scope of scikit-learn isn't > necessary > > > "the latest" as opposed to "the classics." You may also consider > writing a > > > scikit-contrib package that basically creates what you're interested > in in > > > scikit-learn format, but external to the project. We'd be more than > happy > > > to link to it. If the algorithm becomes a smashing success over time, > we'd > > > reconsider adding it to the main code base. > > > > > > As to your first question, you should check out how the current > optimizers > > > are written for the algorithm you're interested in. I don't think > there's a > > > plug and play way to drop in your own optimizer like many deep learning > > > packages support, unfortunately. You'd probably have to modify the code > > > directly to support your own. > > > > > > Let me know if you have any other questions. > > > > > > Jacob > > > > > > On Mon, Jul 10, 2017 at 7:58 AM, G?rhan Ceylan > > > wrote: > > > > > >> Hi everyone, > > >> > > >> I am wondering, How can I use external optimization algorithms with > scikit-learn, > > >> for instance neural network > > >> networks_supervised.html#algorithms> > > >> , instead of defined algorithms ( Stochastic Gradient Descent, Adam, > or > > >> L-BFGS). > > >> > > >> Furthermore, I want to introduce a new unconstrained optimization > > >> algorithm to scikit-learn, implementation of the algorithm and > related paper > > >> can be found here . > > >> > > >> I couldn't find any explanation > > >> , about > the > > >> situation. Do you have defined procedure to make such kind of > > >> contributions? If this is not the case, How should I start to make > such a > > >> proposal/contribution ? > > >> > > >> > > >> Kind regards, > > >> > > >> G?rhan C. > > >> > > >> > > >> _______________________________________________ > > >> scikit-learn mailing list > > >> scikit-learn at python.org > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > >> > > >> > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From seralouk at hotmail.com Fri Jul 14 08:08:14 2017 From: seralouk at hotmail.com (serafeim loukas) Date: Fri, 14 Jul 2017 12:08:14 +0000 Subject: [scikit-learn] Line graph of weighted graph Message-ID: Dear scikit-learn users, I would like to know if there is any function that returns the line graph of a weighted graph. I am aware of the linegraph function (http://igraph.org/python/doc/igraph.GraphBase-class.html#linegraph) but I would like to take the weights into consideration. Thank you, Makis -------------- next part -------------- An HTML attachment was scrubbed... URL: From SebastianFlennerhag at hotmail.com Fri Jul 14 09:49:48 2017 From: SebastianFlennerhag at hotmail.com (Sebastian) Date: Fri, 14 Jul 2017 13:49:48 +0000 Subject: [scikit-learn] Inquiry third-party package affiliation Message-ID: Hi, First off, thanks for a great package! A while ago I needed a package for building general-purpose ensembles combining any set of Scikit-learn transformers and estimators. I couldn't find any so I set out to develop such an extension and recently released the result as ML-Ensemble, http://mlens.readthedocs.io/en/latest/. I am contacting you to ask if you way consider adding this library to your reference list of related packages? It would be hugely appreciated to have a small mention on your site as an ensemble wrapper around Scikit-learn. A bit about ML-Ensemble: It is written in Python following with a Scikit-learn API; it uses joblib with memmapping to achieve scalable parallelization and any ensemble estimator can pass as a Scikit-learn estimator. The library is unit tested on Linux, Mac and Windows for Python 2.7, 3.5 and 3.6, and has been downloaded about four thousand times a month after launch. All the best, Sebastian Flennerhag -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Sat Jul 15 11:16:18 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Sat, 15 Jul 2017 08:16:18 -0700 Subject: [scikit-learn] Agglomerative clustering problem In-Reply-To: References: Message-ID: Typically when I think of limiting the number of points in a cluster I think of KD trees. I suppose that wouldn't work? On Tue, Jul 11, 2017 at 11:22 AM, Ariani A wrote: > ?Dear Uri, > Thanks. I just have a pairwise distance matrix and I want to implement it > so that each cluster has at least 40 data points. (in Agglomerative). > Does it work? > Thanks, > -Ariani > > On Tue, Jul 11, 2017 at 1:54 PM, Uri Goren wrote: > >> Take a look at scipy's fcluster function. >> If M is a matrix of all of your feature vectors, this code snippet should >> work. >> >> You need to figure out what metric and algorithm work for you >> >> from sklearn.metrics import pairwise_distance >> from scipy.cluster import hierarchy >> X = pairwise_distance(M, metric=metric) >> Z = hierarchy.linkage(X, algo, metric=metric) >> C = hierarchy.fcluster(Z,threshold, criterion="distance") >> >> Best, >> Uri Goren >> >> On Tue, Jul 11, 2017 at 7:42 PM, Ariani A wrote: >> >>> Hi all, >>> I want to perform agglomerative clustering, but I have no idea of number >>> of clusters before hand. But I want that every cluster has at least 40 >>> data points in it. How can I apply this to sklearn.agglomerative >>> clustering? >>> Should I use dendrogram and cut it somehow? I have no idea how to relate >>> dendrogram to this and cutting it out. Any help will be appreciated! >>> I have to use agglomerative clustering! >>> Thanks, >>> -Ariani >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> >> >> *Uri Goren,Software innovator* >> >> *Phone: +972-507-649-650* >> >> *EMail: uri at goren4u.com * >> *Linkedin: il.linkedin.com/in/ugoren/ * >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Mon Jul 17 08:49:51 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Mon, 17 Jul 2017 14:49:51 +0200 Subject: [scikit-learn] scikit-learn 0.19b2 is available for testing Message-ID: The new release is coming and we are seeking feedback from beta testers! pip install scikit-learn==0.19b2 conda-forge packages should follow in the coming hours / days. Note that many models have changed behaviors and some things have been deprecated, see the full changelog at: http://scikit-learn.org/dev/whats_new.html#version-0-19 As usual please report any regression or other bugs as an issue on github. Thanks to anyone who contributed to the release! -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel From stuart at stuartreynolds.net Mon Jul 17 12:41:37 2017 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Mon, 17 Jul 2017 09:41:37 -0700 Subject: [scikit-learn] Max f1 score for soft classifier? Message-ID: Does scikit have a function to find the maximum f1 score (and decision threshold) for a (soft) classifier? - Stuart From gael.varoquaux at normalesup.org Mon Jul 17 15:37:13 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Mon, 17 Jul 2017 21:37:13 +0200 Subject: [scikit-learn] scikit-learn 0.19b2 is available for testing In-Reply-To: References: Message-ID: <20170717193713.GD1845013@phare.normalesup.org> Great job! This will be a great release, with a lot of new features and improvements G On Mon, Jul 17, 2017 at 02:49:51PM +0200, Olivier Grisel wrote: > The new release is coming and we are seeking feedback from beta testers! > pip install scikit-learn==0.19b2 > conda-forge packages should follow in the coming hours / days. > Note that many models have changed behaviors and some things have been > deprecated, see the full changelog at: > http://scikit-learn.org/dev/whats_new.html#version-0-19 > As usual please report any regression or other bugs as an issue on github. > Thanks to anyone who contributed to the release! -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From alexandre.gramfort at telecom-paristech.fr Mon Jul 17 16:08:18 2017 From: alexandre.gramfort at telecom-paristech.fr (Alexandre Gramfort) Date: Mon, 17 Jul 2017 22:08:18 +0200 Subject: [scikit-learn] scikit-learn 0.19b2 is available for testing In-Reply-To: <20170717193713.GD1845013@phare.normalesup.org> References: <20170717193713.GD1845013@phare.normalesup.org> Message-ID: great team work as usual ! congrats everyone Alex From stuart at stuartreynolds.net Mon Jul 17 16:12:53 2017 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Mon, 17 Jul 2017 13:12:53 -0700 Subject: [scikit-learn] Max f1 score for soft classifier? In-Reply-To: References: Message-ID: And... with that in mind -- are there methods that explicitly try to optimize the f1 score? On Mon, Jul 17, 2017 at 9:41 AM, Stuart Reynolds wrote: > Does scikit have a function to find the maximum f1 score (and decision > threshold) for a (soft) classifier? > > - Stuart From ashimb9 at gmail.com Mon Jul 17 16:14:49 2017 From: ashimb9 at gmail.com (Ashim Bhattarai) Date: Mon, 17 Jul 2017 15:14:49 -0500 Subject: [scikit-learn] PR review request Message-ID: Hi -- I was wondering if somebody could review the pull request at https://github.com/scikit-learn/scikit-learn/pull/9348 in which I have worked on adding euclidean distance calculation in the presence of NaNs. Thanks in advance. Best, Ashim -------------- next part -------------- An HTML attachment was scrubbed... URL: From bertrand.thirion at inria.fr Mon Jul 17 16:15:20 2017 From: bertrand.thirion at inria.fr (bthirion) Date: Mon, 17 Jul 2017 22:15:20 +0200 Subject: [scikit-learn] scikit-learn 0.19b2 is available for testing In-Reply-To: References: <20170717193713.GD1845013@phare.normalesup.org> Message-ID: <0b442988-2037-b970-2c73-b9f09d77f221@inria.fr> Great work indeed ! Thx, Bertrand On 17/07/2017 22:08, Alexandre Gramfort wrote: > great team work as usual ! > > congrats everyone > > Alex > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From se.raschka at gmail.com Mon Jul 17 16:19:15 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Mon, 17 Jul 2017 16:19:15 -0400 Subject: [scikit-learn] Max f1 score for soft classifier? In-Reply-To: References: Message-ID: <3B9ED23F-C2A5-42D2-AFFA-96F32256A69B@gmail.com> >> Does scikit have a function to find the maximum f1 score (and decision >> threshold) for a (soft) classifier? Hm, I don't think so. F1-score is typically used as evaluation metric; hence, it's something optimized via hyperparameter tuning. There's an interesting publication though, where the authors modified the F1 score so that it's differentiable and can be used as a cost function for optimization/training: Maximum F1-Score Discriminative Training Criterion for Automatic Mispronunciation Detection: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7055841 Best, Sebastian > On Jul 17, 2017, at 4:12 PM, Stuart Reynolds wrote: > > And... with that in mind -- are there methods that explicitly try to > optimize the f1 score? > > On Mon, Jul 17, 2017 at 9:41 AM, Stuart Reynolds > wrote: >> Does scikit have a function to find the maximum f1 score (and decision >> threshold) for a (soft) classifier? >> >> - Stuart > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From joel.nothman at gmail.com Mon Jul 17 19:58:37 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 18 Jul 2017 09:58:37 +1000 Subject: [scikit-learn] Max f1 score for soft classifier? In-Reply-To: References: Message-ID: I suppose it would not be hard to build a wrapper that does this, if all we are doing is choosing a threshold. Although a global maximum is not guaranteed without some kind of interpolation over the precision-recall curve. On 18 July 2017 at 02:41, Stuart Reynolds wrote: > Does scikit have a function to find the maximum f1 score (and decision > threshold) for a (soft) classifier? > > - Stuart > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stuart at stuartreynolds.net Mon Jul 17 20:06:30 2017 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Tue, 18 Jul 2017 00:06:30 +0000 Subject: [scikit-learn] Max f1 score for soft classifier? In-Reply-To: References: Message-ID: That was also my thinking. Similarly it's also useful to try and choose a threshold that achieves some tpr or fpr, so that methods can be approximately compared to published results. It's not obvious what to do though when an increment in the threshold results in several changes in classification. On Mon, Jul 17, 2017 at 5:00 PM Joel Nothman wrote: > I suppose it would not be hard to build a wrapper that does this, if all > we are doing is choosing a threshold. Although a global maximum is not > guaranteed without some kind of interpolation over the precision-recall > curve. > > On 18 July 2017 at 02:41, Stuart Reynolds > wrote: > >> Does scikit have a function to find the maximum f1 score (and decision >> threshold) for a (soft) classifier? >> >> - Stuart >> > _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Jul 18 12:49:42 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 18 Jul 2017 12:49:42 -0400 Subject: [scikit-learn] Max f1 score for soft classifier? In-Reply-To: References: Message-ID: <3ab6ae32-8278-62a7-7987-f8308a4805ef@gmail.com> Feature request for a slightly more general solution here: https://github.com/scikit-learn/scikit-learn/issues/8614 On 07/17/2017 04:12 PM, Stuart Reynolds wrote: > And... with that in mind -- are there methods that explicitly try to > optimize the f1 score? > > On Mon, Jul 17, 2017 at 9:41 AM, Stuart Reynolds > wrote: >> Does scikit have a function to find the maximum f1 score (and decision >> threshold) for a (soft) classifier? >> >> - Stuart > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From ruchika.work at gmail.com Thu Jul 20 11:23:12 2017 From: ruchika.work at gmail.com (Ruchika Nayyar) Date: Thu, 20 Jul 2017 11:23:12 -0400 Subject: [scikit-learn] merging the predicted labels with original dataframe Message-ID: Hi Scikit-learn Users, I am analyzing some proxy logs to use Machine learning to classify the events recorded as either "OBSERVED" or "BLOCKED". This is a little snippet of my code: The input file is a csv with tokenized string fields. ************** # load the file M = pd.read_csv("output100k.csv").fillna('') # define the fields to use min_df = 0.001 max_df = .7 TxtCols = ['request__tokens', 'requestClientApplication__tokens', 'destinationZoneURI__tokens','cs-categories__tokens', 'fileType__tokens', 'requestMethod__tokens','tcp_status1', 'app','tcp_status2','dhost' ] NumCols = ['rt', 'out', 'in', 'time-taken','rt_length', 'dt_length'] # vectorize the fields TfidfModels = [TfidfVectorizer(min_df = min_df, max_df=max_df).fit(M[t]) for t in TxtCols] # define the columns of sparse matrix X = hstack([m.transform(M[n].fillna('')) for m,n in zip(TfidfModels, TxtCols)] + \ [csr_matrix(pd.to_numeric(M[n]).fillna(-1).values).T for n in NumCols]) # target variable Y = M.act.values ## Define train/test parts and scale them X_train, X_test, y_train, y_test = tts(X, Y, test_size=0.2) scaler = StandardScaler(with_mean=False, with_std=True) scaler.fit(X_train) X_train=scaler.transform(X_train) X_test=scaler.transform(X_test) # define the model and train clf = MLPClassifier(activation='logistic', solver='lbfgs').fit(X_train,y_train) # use the model to predict on X_test and convert into a data frame df=pd.DataFrame(clf.predict(X_test)) ** 199845 OBSERVED 199846 OBSERVED [199847 rows x 1 columns]> ** Now at the end I have a DataFrame with 20K entries with just one column "Label", how di I connect it to the main dataframe M, since I want to do some investigations on this outcome ? Any help? Thanks, Ruchika -------------- next part -------------- An HTML attachment was scrubbed... URL: From julio at esbet.es Thu Jul 20 11:37:58 2017 From: julio at esbet.es (Julio Antonio Soto de Vicente) Date: Thu, 20 Jul 2017 17:37:58 +0200 Subject: [scikit-learn] merging the predicted labels with original dataframe In-Reply-To: References: Message-ID: <1FE71505-E709-4D57-98A5-4877CE0168D5@esbet.es> Hi Ruchika, The predictions outputted by all sklearn models are just 1-d Numpy arrays, so it should be trivial to add it to any existing DataFrame: your_df["prediction"] = clf.predict(X_test) -- Julio > El 20 jul 2017, a las 17:23, Ruchika Nayyar escribi?: > > Hi Scikit-learn Users, > > I am analyzing some proxy logs to use Machine learning to classify the events recorded as either "OBSERVED" or "BLOCKED". This is a little snippet of my code: > The input file is a csv with tokenized string fields. > > ************** > # load the file > M = pd.read_csv("output100k.csv").fillna('') > > # define the fields to use > min_df = 0.001 > max_df = .7 > TxtCols = ['request__tokens', 'requestClientApplication__tokens', > 'destinationZoneURI__tokens','cs-categories__tokens', > 'fileType__tokens', 'requestMethod__tokens','tcp_status1', > 'app','tcp_status2','dhost' > ] > NumCols = ['rt', 'out', 'in', 'time-taken','rt_length', 'dt_length'] > > # vectorize the fields > TfidfModels = [TfidfVectorizer(min_df = min_df, max_df=max_df).fit(M[t]) for t in TxtCols] > > # define the columns of sparse matrix > X = hstack([m.transform(M[n].fillna('')) for m,n in zip(TfidfModels, TxtCols)] + \ > [csr_matrix(pd.to_numeric(M[n]).fillna(-1).values).T for n in NumCols]) > > # target variable > Y = M.act.values > > ## Define train/test parts and scale them > X_train, X_test, y_train, y_test = tts(X, Y, test_size=0.2) > scaler = StandardScaler(with_mean=False, with_std=True) > scaler.fit(X_train) > X_train=scaler.transform(X_train) > X_test=scaler.transform(X_test) > > > # define the model and train > clf = MLPClassifier(activation='logistic', solver='lbfgs').fit(X_train,y_train) > # use the model to predict on X_test and convert into a data frame > df=pd.DataFrame(clf.predict(X_test)) > > ** > 199845 OBSERVED > 199846 OBSERVED > [199847 rows x 1 columns]> > ** > Now at the end I have a DataFrame with 20K entries with just one column > "Label", how di I connect it to the main dataframe M, since I want to do some > investigations on this outcome ? > > Any help? > > Thanks, > Ruchika > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From ruchika.work at gmail.com Thu Jul 20 12:04:00 2017 From: ruchika.work at gmail.com (Ruchika Nayyar) Date: Thu, 20 Jul 2017 12:04:00 -0400 Subject: [scikit-learn] merging the predicted labels with original dataframe In-Reply-To: <1FE71505-E709-4D57-98A5-4877CE0168D5@esbet.es> References: <1FE71505-E709-4D57-98A5-4877CE0168D5@esbet.es> Message-ID: The original dataset contains both trainng/testing, I have predictions only on testing dataset. If I do what you suggest will it preserve indexing? Thanks, Ruchika On Thu, Jul 20, 2017 at 11:37 AM, Julio Antonio Soto de Vicente < julio at esbet.es> wrote: > Hi Ruchika, > > The predictions outputted by all sklearn models are just 1-d Numpy arrays, > so it should be trivial to add it to any existing DataFrame: > > your_df["prediction"] = clf.predict(X_test) > > -- > Julio > > El 20 jul 2017, a las 17:23, Ruchika Nayyar > escribi?: > > Hi Scikit-learn Users, > > I am analyzing some proxy logs to use Machine learning to classify the > events recorded as either "OBSERVED" or "BLOCKED". This is a little snippet > of my code: > The input file is a csv with tokenized string fields. > > ************** > # load the file > M = pd.read_csv("output100k.csv").fillna('') > > # define the fields to use > min_df = 0.001 > max_df = .7 > TxtCols = ['request__tokens', 'requestClientApplication__tokens', > 'destinationZoneURI__tokens','cs-categories__tokens', > 'fileType__tokens', 'requestMethod__tokens','tcp_status1', > 'app','tcp_status2','dhost' > ] > NumCols = ['rt', 'out', 'in', 'time-taken','rt_length', 'dt_length'] > > # vectorize the fields > TfidfModels = [TfidfVectorizer(min_df = min_df, max_df=max_df).fit(M[t]) > for t in TxtCols] > > # define the columns of sparse matrix > X = hstack([m.transform(M[n].fillna('')) for m,n in zip(TfidfModels, > TxtCols)] + \ > [csr_matrix(pd.to_numeric(M[n]).fillna(-1).values).T for n > in NumCols]) > > # target variable > Y = M.act.values > > ## Define train/test parts and scale them > X_train, X_test, y_train, y_test = tts(X, Y, test_size=0.2) > scaler = StandardScaler(with_mean=False, with_std=True) > scaler.fit(X_train) > X_train=scaler.transform(X_train) > X_test=scaler.transform(X_test) > > > # define the model and train > clf = MLPClassifier(activation='logistic', solver='lbfgs').fit(X_train,y_ > train) > # use the model to predict on X_test and convert into a data frame > df=pd.DataFrame(clf.predict(X_test)) > > ** > > 199845 OBSERVED > 199846 OBSERVED > > [199847 rows x 1 columns]> > > ** > > Now at the end I have a DataFrame with 20K entries with just one column > "Label", how di I connect it to the main dataframe M, since I want to do > some > investigations on this outcome ? > > Any help? > > Thanks, > Ruchika > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Thu Jul 20 12:19:47 2017 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Thu, 20 Jul 2017 11:19:47 -0500 Subject: [scikit-learn] merging the predicted labels with original dataframe In-Reply-To: References: <1FE71505-E709-4D57-98A5-4877CE0168D5@esbet.es> Message-ID: Something like your_df['prediction'] = pd.Series(clf.predict(X_test), index=X_test.index) should handle all the alignment. On Thu, Jul 20, 2017 at 11:04 AM, Ruchika Nayyar wrote: > The original dataset contains both trainng/testing, I have predictions > only on testing dataset. If I do what you suggest > will it preserve indexing? > > Thanks, > Ruchika > > > On Thu, Jul 20, 2017 at 11:37 AM, Julio Antonio Soto de Vicente < > julio at esbet.es> wrote: > >> Hi Ruchika, >> >> The predictions outputted by all sklearn models are just 1-d Numpy >> arrays, so it should be trivial to add it to any existing DataFrame: >> >> your_df["prediction"] = clf.predict(X_test) >> >> -- >> Julio >> >> El 20 jul 2017, a las 17:23, Ruchika Nayyar >> escribi?: >> >> Hi Scikit-learn Users, >> >> I am analyzing some proxy logs to use Machine learning to classify the >> events recorded as either "OBSERVED" or "BLOCKED". This is a little snippet >> of my code: >> The input file is a csv with tokenized string fields. >> >> ************** >> # load the file >> M = pd.read_csv("output100k.csv").fillna('') >> >> # define the fields to use >> min_df = 0.001 >> max_df = .7 >> TxtCols = ['request__tokens', 'requestClientApplication__tokens', >> 'destinationZoneURI__tokens','cs-categories__tokens', >> 'fileType__tokens', 'requestMethod__tokens','tcp_status1', >> 'app','tcp_status2','dhost' >> ] >> NumCols = ['rt', 'out', 'in', 'time-taken','rt_length', 'dt_length'] >> >> # vectorize the fields >> TfidfModels = [TfidfVectorizer(min_df = min_df, max_df=max_df).fit(M[t]) >> for t in TxtCols] >> >> # define the columns of sparse matrix >> X = hstack([m.transform(M[n].fillna('')) for m,n in zip(TfidfModels, >> TxtCols)] + \ >> [csr_matrix(pd.to_numeric(M[n]).fillna(-1).values).T for >> n in NumCols]) >> >> # target variable >> Y = M.act.values >> >> ## Define train/test parts and scale them >> X_train, X_test, y_train, y_test = tts(X, Y, test_size=0.2) >> scaler = StandardScaler(with_mean=False, with_std=True) >> scaler.fit(X_train) >> X_train=scaler.transform(X_train) >> X_test=scaler.transform(X_test) >> >> >> # define the model and train >> clf = MLPClassifier(activation='logistic', solver='lbfgs').fit(X_train,y_ >> train) >> # use the model to predict on X_test and convert into a data frame >> df=pd.DataFrame(clf.predict(X_test)) >> >> ** >> >> 199845 OBSERVED >> 199846 OBSERVED >> >> [199847 rows x 1 columns]> >> >> ** >> >> Now at the end I have a DataFrame with 20K entries with just one column >> "Label", how di I connect it to the main dataframe M, since I want to do >> some >> investigations on this outcome ? >> >> Any help? >> >> Thanks, >> Ruchika >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ruchika.work at gmail.com Thu Jul 20 12:30:24 2017 From: ruchika.work at gmail.com (Ruchika Nayyar) Date: Thu, 20 Jul 2017 12:30:24 -0400 Subject: [scikit-learn] merging the predicted labels with original dataframe In-Reply-To: References: <1FE71505-E709-4D57-98A5-4877CE0168D5@esbet.es> Message-ID: Hi Tom This was also the first thing that came to my mind, but I thought sincr your_df is X_train+X_test it may complain that values do not match with the given indices. Thanks, Ruchika On Thu, Jul 20, 2017 at 12:19 PM, Tom Augspurger wrote: > Something like > > your_df['prediction'] = pd.Series(clf.predict(X_test), > index=X_test.index) > > should handle all the alignment. > > On Thu, Jul 20, 2017 at 11:04 AM, Ruchika Nayyar > wrote: > >> The original dataset contains both trainng/testing, I have predictions >> only on testing dataset. If I do what you suggest >> will it preserve indexing? >> >> Thanks, >> Ruchika >> >> >> On Thu, Jul 20, 2017 at 11:37 AM, Julio Antonio Soto de Vicente < >> julio at esbet.es> wrote: >> >>> Hi Ruchika, >>> >>> The predictions outputted by all sklearn models are just 1-d Numpy >>> arrays, so it should be trivial to add it to any existing DataFrame: >>> >>> your_df["prediction"] = clf.predict(X_test) >>> >>> -- >>> Julio >>> >>> El 20 jul 2017, a las 17:23, Ruchika Nayyar >>> escribi?: >>> >>> Hi Scikit-learn Users, >>> >>> I am analyzing some proxy logs to use Machine learning to classify the >>> events recorded as either "OBSERVED" or "BLOCKED". This is a little snippet >>> of my code: >>> The input file is a csv with tokenized string fields. >>> >>> ************** >>> # load the file >>> M = pd.read_csv("output100k.csv").fillna('') >>> >>> # define the fields to use >>> min_df = 0.001 >>> max_df = .7 >>> TxtCols = ['request__tokens', 'requestClientApplication__tokens', >>> 'destinationZoneURI__tokens','cs-categories__tokens', >>> 'fileType__tokens', 'requestMethod__tokens','tcp_status1', >>> 'app','tcp_status2','dhost' >>> ] >>> NumCols = ['rt', 'out', 'in', 'time-taken','rt_length', 'dt_length'] >>> >>> # vectorize the fields >>> TfidfModels = [TfidfVectorizer(min_df = min_df, max_df=max_df).fit(M[t]) >>> for t in TxtCols] >>> >>> # define the columns of sparse matrix >>> X = hstack([m.transform(M[n].fillna('')) for m,n in zip(TfidfModels, >>> TxtCols)] + \ >>> [csr_matrix(pd.to_numeric(M[n]).fillna(-1).values).T for >>> n in NumCols]) >>> >>> # target variable >>> Y = M.act.values >>> >>> ## Define train/test parts and scale them >>> X_train, X_test, y_train, y_test = tts(X, Y, test_size=0.2) >>> scaler = StandardScaler(with_mean=False, with_std=True) >>> scaler.fit(X_train) >>> X_train=scaler.transform(X_train) >>> X_test=scaler.transform(X_test) >>> >>> >>> # define the model and train >>> clf = MLPClassifier(activation='logistic', >>> solver='lbfgs').fit(X_train,y_train) >>> # use the model to predict on X_test and convert into a data frame >>> df=pd.DataFrame(clf.predict(X_test)) >>> >>> ** >>> >>> 199845 OBSERVED >>> 199846 OBSERVED >>> >>> [199847 rows x 1 columns]> >>> >>> ** >>> >>> Now at the end I have a DataFrame with 20K entries with just one column >>> "Label", how di I connect it to the main dataframe M, since I want to do >>> some >>> investigations on this outcome ? >>> >>> Any help? >>> >>> Thanks, >>> Ruchika >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From raga.markely at gmail.com Fri Jul 21 11:11:20 2017 From: raga.markely at gmail.com (Raga Markely) Date: Fri, 21 Jul 2017 11:11:20 -0400 Subject: [scikit-learn] Classifiers for dataset with categorical features Message-ID: Hello, I am wondering if there are some classifiers that perform better for datasets with categorical features (converted into sparse input matrix with pd.get_dummies())? The data for the categorical features are nominal (order doesn't matter, e.g. country, occupation, etc). If you could provide me some references (papers, books, website, etc), that would be great. Thank you very much! Raga -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Fri Jul 21 14:27:50 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Fri, 21 Jul 2017 11:27:50 -0700 Subject: [scikit-learn] Classifiers for dataset with categorical features In-Reply-To: References: Message-ID: Traditionally tree based methods are very good when it comes to categorical variables and can handle them appropriately. There is a current WIP PR to add this support to sklearn. I'm not exactly sure what you mean that "perform better" though. Estimators that ignore the categorical aspect of these variables and treat them as discrete will likely perform worse than those that treat them appropriately. On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely wrote: > Hello, > > I am wondering if there are some classifiers that perform better for > datasets with categorical features (converted into sparse input matrix with > pd.get_dummies())? The data for the categorical features are nominal (order > doesn't matter, e.g. country, occupation, etc). > > If you could provide me some references (papers, books, website, etc), > that would be great. > > Thank you very much! > Raga > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Fri Jul 21 14:27:50 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Fri, 21 Jul 2017 11:27:50 -0700 Subject: [scikit-learn] Classifiers for dataset with categorical features In-Reply-To: References: Message-ID: Traditionally tree based methods are very good when it comes to categorical variables and can handle them appropriately. There is a current WIP PR to add this support to sklearn. I'm not exactly sure what you mean that "perform better" though. Estimators that ignore the categorical aspect of these variables and treat them as discrete will likely perform worse than those that treat them appropriately. On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely wrote: > Hello, > > I am wondering if there are some classifiers that perform better for > datasets with categorical features (converted into sparse input matrix with > pd.get_dummies())? The data for the categorical features are nominal (order > doesn't matter, e.g. country, occupation, etc). > > If you could provide me some references (papers, books, website, etc), > that would be great. > > Thank you very much! > Raga > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From raga.markely at gmail.com Fri Jul 21 14:37:25 2017 From: raga.markely at gmail.com (Raga Markely) Date: Fri, 21 Jul 2017 14:37:25 -0400 Subject: [scikit-learn] Classifiers for dataset with categorical features In-Reply-To: References: Message-ID: Thank you, Jacob. Appreciate it. Regarding 'perform better', I was referring to better accuracy, precision, recall, F1 score, etc. Thanks, Raga On Fri, Jul 21, 2017 at 2:27 PM, Jacob Schreiber wrote: > Traditionally tree based methods are very good when it comes to > categorical variables and can handle them appropriately. There is a current > WIP PR to add this support to sklearn. I'm not exactly sure what you mean > that "perform better" though. Estimators that ignore the categorical aspect > of these variables and treat them as discrete will likely perform worse > than those that treat them appropriately. > > On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely > wrote: > >> Hello, >> >> I am wondering if there are some classifiers that perform better for >> datasets with categorical features (converted into sparse input matrix with >> pd.get_dummies())? The data for the categorical features are nominal (order >> doesn't matter, e.g. country, occupation, etc). >> >> If you could provide me some references (papers, books, website, etc), >> that would be great. >> >> Thank you very much! >> Raga >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Fri Jul 21 14:52:11 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Fri, 21 Jul 2017 14:52:11 -0400 Subject: [scikit-learn] Classifiers for dataset with categorical features In-Reply-To: References: Message-ID: <8434D3C4-503B-4D6B-A4FB-ABB684B7DD71@gmail.com> Just to throw some additional ideas in here. Based on a conversation with a colleague some time ago, I think learning classifier systems (https://en.wikipedia.org/wiki/Learning_classifier_system) are particularly useful when working with large, sparse binary vectors (like from a one-hot encoding). I am really not into LCS's, and only know the basics (read through the first chapters of the Intro to Learning Classifier Systems draft; the print version will be out later this year). Also, I saw an interesting poster on a Set Covering Machine algorithm once, which they benchmarked against SVMs, random forests and the like for categorical (genomics data). Looked promising. Best, Sebastian > On Jul 21, 2017, at 2:37 PM, Raga Markely wrote: > > Thank you, Jacob. Appreciate it. > > Regarding 'perform better', I was referring to better accuracy, precision, recall, F1 score, etc. > > Thanks, > Raga > > On Fri, Jul 21, 2017 at 2:27 PM, Jacob Schreiber wrote: > Traditionally tree based methods are very good when it comes to categorical variables and can handle them appropriately. There is a current WIP PR to add this support to sklearn. I'm not exactly sure what you mean that "perform better" though. Estimators that ignore the categorical aspect of these variables and treat them as discrete will likely perform worse than those that treat them appropriately. > > On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely wrote: > Hello, > > I am wondering if there are some classifiers that perform better for datasets with categorical features (converted into sparse input matrix with pd.get_dummies())? The data for the categorical features are nominal (order doesn't matter, e.g. country, occupation, etc). > > If you could provide me some references (papers, books, website, etc), that would be great. > > Thank you very much! > Raga > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From se.raschka at gmail.com Fri Jul 21 14:57:57 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Fri, 21 Jul 2017 14:57:57 -0400 Subject: [scikit-learn] Classifiers for dataset with categorical features In-Reply-To: <8434D3C4-503B-4D6B-A4FB-ABB684B7DD71@gmail.com> References: <8434D3C4-503B-4D6B-A4FB-ABB684B7DD71@gmail.com> Message-ID: > Traditionally tree based methods are very good when it comes to categorical variables and can handle them appropriately. There is a current WIP PR to add this support to sklearn. I think it's also important to distinguish between nominal and ordinal; it can make a huge difference imho. I.e., treating ordinal variables like continuous variable probably makes more sense than one-hot encoding them. Looking forward to the PR :) > On Jul 21, 2017, at 2:52 PM, Sebastian Raschka wrote: > > Just to throw some additional ideas in here. Based on a conversation with a colleague some time ago, I think learning classifier systems (https://en.wikipedia.org/wiki/Learning_classifier_system) are particularly useful when working with large, sparse binary vectors (like from a one-hot encoding). I am really not into LCS's, and only know the basics (read through the first chapters of the Intro to Learning Classifier Systems draft; the print version will be out later this year). > Also, I saw an interesting poster on a Set Covering Machine algorithm once, which they benchmarked against SVMs, random forests and the like for categorical (genomics data). Looked promising. > > Best, > Sebastian > > >> On Jul 21, 2017, at 2:37 PM, Raga Markely wrote: >> >> Thank you, Jacob. Appreciate it. >> >> Regarding 'perform better', I was referring to better accuracy, precision, recall, F1 score, etc. >> >> Thanks, >> Raga >> >> On Fri, Jul 21, 2017 at 2:27 PM, Jacob Schreiber wrote: >> Traditionally tree based methods are very good when it comes to categorical variables and can handle them appropriately. There is a current WIP PR to add this support to sklearn. I'm not exactly sure what you mean that "perform better" though. Estimators that ignore the categorical aspect of these variables and treat them as discrete will likely perform worse than those that treat them appropriately. >> >> On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely wrote: >> Hello, >> >> I am wondering if there are some classifiers that perform better for datasets with categorical features (converted into sparse input matrix with pd.get_dummies())? The data for the categorical features are nominal (order doesn't matter, e.g. country, occupation, etc). >> >> If you could provide me some references (papers, books, website, etc), that would be great. >> >> Thank you very much! >> Raga >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From raga.markely at gmail.com Fri Jul 21 14:59:40 2017 From: raga.markely at gmail.com (Raga Markely) Date: Fri, 21 Jul 2017 14:59:40 -0400 Subject: [scikit-learn] Classifiers for dataset with categorical features In-Reply-To: <8434D3C4-503B-4D6B-A4FB-ABB684B7DD71@gmail.com> References: <8434D3C4-503B-4D6B-A4FB-ABB684B7DD71@gmail.com> Message-ID: Sounds good, Sebastian. Thank you! Raga On Fri, Jul 21, 2017 at 2:52 PM, Sebastian Raschka wrote: > Just to throw some additional ideas in here. Based on a conversation with > a colleague some time ago, I think learning classifier systems ( > https://en.wikipedia.org/wiki/Learning_classifier_system) are > particularly useful when working with large, sparse binary vectors (like > from a one-hot encoding). I am really not into LCS's, and only know the > basics (read through the first chapters of the Intro to Learning Classifier > Systems draft; the print version will be out later this year). > Also, I saw an interesting poster on a Set Covering Machine algorithm > once, which they benchmarked against SVMs, random forests and the like for > categorical (genomics data). Looked promising. > > Best, > Sebastian > > > > On Jul 21, 2017, at 2:37 PM, Raga Markely > wrote: > > > > Thank you, Jacob. Appreciate it. > > > > Regarding 'perform better', I was referring to better accuracy, > precision, recall, F1 score, etc. > > > > Thanks, > > Raga > > > > On Fri, Jul 21, 2017 at 2:27 PM, Jacob Schreiber < > jmschreiber91 at gmail.com> wrote: > > Traditionally tree based methods are very good when it comes to > categorical variables and can handle them appropriately. There is a current > WIP PR to add this support to sklearn. I'm not exactly sure what you mean > that "perform better" though. Estimators that ignore the categorical aspect > of these variables and treat them as discrete will likely perform worse > than those that treat them appropriately. > > > > On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely > wrote: > > Hello, > > > > I am wondering if there are some classifiers that perform better for > datasets with categorical features (converted into sparse input matrix with > pd.get_dummies())? The data for the categorical features are nominal (order > doesn't matter, e.g. country, occupation, etc). > > > > If you could provide me some references (papers, books, website, etc), > that would be great. > > > > Thank you very much! > > Raga > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stuart at stuartreynolds.net Fri Jul 21 15:01:47 2017 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Fri, 21 Jul 2017 12:01:47 -0700 Subject: [scikit-learn] Classifiers for dataset with categorical features In-Reply-To: <8434D3C4-503B-4D6B-A4FB-ABB684B7DD71@gmail.com> References: <8434D3C4-503B-4D6B-A4FB-ABB684B7DD71@gmail.com> Message-ID: +1 LCS and its many many variants seem very practical and adaptable. I'm not sure why they haven't gotten traction. Overshadowed by GBM & random forests? On Fri, Jul 21, 2017 at 11:52 AM, Sebastian Raschka wrote: > Just to throw some additional ideas in here. Based on a conversation with a colleague some time ago, I think learning classifier systems (https://en.wikipedia.org/wiki/Learning_classifier_system) are particularly useful when working with large, sparse binary vectors (like from a one-hot encoding). I am really not into LCS's, and only know the basics (read through the first chapters of the Intro to Learning Classifier Systems draft; the print version will be out later this year). > Also, I saw an interesting poster on a Set Covering Machine algorithm once, which they benchmarked against SVMs, random forests and the like for categorical (genomics data). Looked promising. > > Best, > Sebastian > > >> On Jul 21, 2017, at 2:37 PM, Raga Markely wrote: >> >> Thank you, Jacob. Appreciate it. >> >> Regarding 'perform better', I was referring to better accuracy, precision, recall, F1 score, etc. >> >> Thanks, >> Raga >> >> On Fri, Jul 21, 2017 at 2:27 PM, Jacob Schreiber wrote: >> Traditionally tree based methods are very good when it comes to categorical variables and can handle them appropriately. There is a current WIP PR to add this support to sklearn. I'm not exactly sure what you mean that "perform better" though. Estimators that ignore the categorical aspect of these variables and treat them as discrete will likely perform worse than those that treat them appropriately. >> >> On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely wrote: >> Hello, >> >> I am wondering if there are some classifiers that perform better for datasets with categorical features (converted into sparse input matrix with pd.get_dummies())? The data for the categorical features are nominal (order doesn't matter, e.g. country, occupation, etc). >> >> If you could provide me some references (papers, books, website, etc), that would be great. >> >> Thank you very much! >> Raga >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From se.raschka at gmail.com Fri Jul 21 19:09:03 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Fri, 21 Jul 2017 19:09:03 -0400 Subject: [scikit-learn] Classifiers for dataset with categorical features In-Reply-To: References: <8434D3C4-503B-4D6B-A4FB-ABB684B7DD71@gmail.com> Message-ID: Maybe because they are genetic algorithms, which are -- for some reason -- not very popular in the ML field in general :P. (People in bioinformatics seem to use them a lot though.). Also, the name "Learning Classifier Systems" is also a bit weird I'd must say: I remember that when Ryan introduced me to those, I was like "ah yeah, sure, I know machine learning classifiers" ;) > On Jul 21, 2017, at 3:01 PM, Stuart Reynolds wrote: > > +1 > LCS and its many many variants seem very practical and adaptable. I'm > not sure why they haven't gotten traction. > Overshadowed by GBM & random forests? > > > On Fri, Jul 21, 2017 at 11:52 AM, Sebastian Raschka > wrote: >> Just to throw some additional ideas in here. Based on a conversation with a colleague some time ago, I think learning classifier systems (https://en.wikipedia.org/wiki/Learning_classifier_system) are particularly useful when working with large, sparse binary vectors (like from a one-hot encoding). I am really not into LCS's, and only know the basics (read through the first chapters of the Intro to Learning Classifier Systems draft; the print version will be out later this year). >> Also, I saw an interesting poster on a Set Covering Machine algorithm once, which they benchmarked against SVMs, random forests and the like for categorical (genomics data). Looked promising. >> >> Best, >> Sebastian >> >> >>> On Jul 21, 2017, at 2:37 PM, Raga Markely wrote: >>> >>> Thank you, Jacob. Appreciate it. >>> >>> Regarding 'perform better', I was referring to better accuracy, precision, recall, F1 score, etc. >>> >>> Thanks, >>> Raga >>> >>> On Fri, Jul 21, 2017 at 2:27 PM, Jacob Schreiber wrote: >>> Traditionally tree based methods are very good when it comes to categorical variables and can handle them appropriately. There is a current WIP PR to add this support to sklearn. I'm not exactly sure what you mean that "perform better" though. Estimators that ignore the categorical aspect of these variables and treat them as discrete will likely perform worse than those that treat them appropriately. >>> >>> On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely wrote: >>> Hello, >>> >>> I am wondering if there are some classifiers that perform better for datasets with categorical features (converted into sparse input matrix with pd.get_dummies())? The data for the categorical features are nominal (order doesn't matter, e.g. country, occupation, etc). >>> >>> If you could provide me some references (papers, books, website, etc), that would be great. >>> >>> Thank you very much! >>> Raga >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From jmschreiber91 at gmail.com Sat Jul 22 15:07:15 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Sat, 22 Jul 2017 12:07:15 -0700 Subject: [scikit-learn] scikit-learn hits 20k github stars Message-ID: [image: Inline image 1] Many thanks to everyone who has worked on and contributed to the project for the past decade to make it such a great tool! Also a special thanks to Joel Nothman, who has been on top of answering issues and reviewing PRs for years now. ?? -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 9294 bytes Desc: not available URL: From joel.nothman at gmail.com Sat Jul 22 23:33:42 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Sun, 23 Jul 2017 13:33:42 +1000 Subject: [scikit-learn] scikit-learn hits 20k github stars In-Reply-To: References: Message-ID: oh, thanks. but the last year isn't years! There has been plenty of great work and personal example, dedication, experience and expertise to build on. I'm not naming names. Congratulations everyone! It's no small feat for software to remain relevant for a decade, let alone to keep accruing stars in a domain with so much change as machine learning. that also means processing contributions constantly, almost entirely on volunteered time. We appreciate every bit of help sorting through them, ensuring we stay relevant and high quality. I certainly can't do that by myself. On 23 Jul 2017 5:13 am, "Jacob Schreiber" wrote: > [image: Inline image 1] > > Many thanks to everyone who has worked on and contributed to the project > for the past decade to make it such a great tool! Also a special thanks to > Joel Nothman, who has been on top of answering issues and reviewing PRs for > years now. > > ?? > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 9294 bytes Desc: not available URL: From sambarnett95 at gmail.com Mon Jul 24 08:57:25 2017 From: sambarnett95 at gmail.com (Sam Barnett) Date: Mon, 24 Jul 2017 13:57:25 +0100 Subject: [scikit-learn] Fwd: Custom transformer failing check_estimator test In-Reply-To: References: Message-ID: Dear scikit-learn developers, I am developing a transformer, named Sqizer, that has the ultimate goal of modifying a kernel for use with the sklearn.svm package. When given an input data array X, Sqizer.transform(X) should have as its output the Gram matrix for X using the modified version of the kernel. Here is the code for the class so far: class Sqizer(BaseEstimator, TransformerMixin): def __init__(self, C=1.0, kernel='rbf', degree=3, gamma=1, coef0=0.0, cut_ord_pair=(2,1)): self.C = C self.kernel = kernel self.degree = degree self.gamma = gamma self.coef0 = coef0 self.cut_ord_pair = cut_ord_pair def fit(self, X, y=None): # Check that X and y have correct shape X, y = check_X_y(X, y) # Store the classes seen during fit self.classes_ = unique_labels(y) self.X_ = X self.y_ = y return self def transform(self, X): X = check_array(X, warn_on_dtype=True) """Returns Gram matrix corresponding to X, once sqized.""" def kPolynom(x,y): return (self.coef0+self.gamma*np.inner(x,y))**self.degree def kGauss(x,y): return np.exp(-self.gamma*np.sum(np.square(x-y))) def kLinear(x,y): return np.inner(x,y) def kSigmoid(x,y): return np.tanh(self.gamma*np.inner(x,y) +self.coef0) def kernselect(kername): switcher = { 'linear': kPolynom, 'rbf': kGauss, 'sigmoid': kLinear, 'poly': kSigmoid, } return switcher.get(kername, "nothing") cut_off = self.cut_ord_pair[0] order = self.cut_ord_pair[1] from SeqKernel import hiSeqKernEval def getGram(Y): gram_matrix = np.zeros((Y.shape[0], Y.shape[0])) for row1ind in range(Y.shape[0]): for row2ind in range(X.shape[0]): gram_matrix[row1ind,row2ind] = \ hiSeqKernEval(Y[row1ind],Y[row2ind],kernselect(self.kernel),\ cut_off,order) return gram_matrix return getGram(X) However, when I run the check_estimator method on Sqizer, I get an error with the following check: # raises error on malformed input for transformif hasattr(X, 'T'): # If it's not an array, it does not have a 'T' property assert_raises(ValueError, transformer.transform, X.T) How do I alter my code to pass this test? Could my estimator trip up on any further tests? I have attached the relevant .py files if you require a bigger picture. This particular snippet comes from the OptimalKernel.py file. Many thanks, Sam Barnett -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: OptimalKernel.py Type: text/x-python-script Size: 8516 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: SeqKernel.py Type: text/x-python-script Size: 4199 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: TensorTools.py Type: text/x-python-script Size: 983 bytes Desc: not available URL: From joel.nothman at gmail.com Mon Jul 24 19:54:32 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 25 Jul 2017 09:54:32 +1000 Subject: [scikit-learn] Fwd: Custom transformer failing check_estimator test In-Reply-To: References: Message-ID: what is the failing test? please provide the full traceback. On 24 Jul 2017 10:58 pm, "Sam Barnett" wrote: > Dear scikit-learn developers, > > I am developing a transformer, named Sqizer, that has the ultimate goal > of modifying a kernel for use with the sklearn.svm package. When given an > input data array X, Sqizer.transform(X) should have as its output the > Gram matrix for X using the modified version of the kernel. Here is the > code for the class so far: > > class Sqizer(BaseEstimator, TransformerMixin): > > def __init__(self, C=1.0, kernel='rbf', degree=3, gamma=1, > coef0=0.0, cut_ord_pair=(2,1)): > self.C = C > self.kernel = kernel > self.degree = degree > self.gamma = gamma > self.coef0 = coef0 > self.cut_ord_pair = cut_ord_pair > > def fit(self, X, y=None): > # Check that X and y have correct shape > X, y = check_X_y(X, y) > # Store the classes seen during fit > self.classes_ = unique_labels(y) > > self.X_ = X > self.y_ = y > return self > > def transform(self, X): > > X = check_array(X, warn_on_dtype=True) > > """Returns Gram matrix corresponding to X, once sqized.""" > def kPolynom(x,y): > return (self.coef0+self.gamma*np.inner(x,y))**self.degree > def kGauss(x,y): > return np.exp(-self.gamma*np.sum(np.square(x-y))) > def kLinear(x,y): > return np.inner(x,y) > def kSigmoid(x,y): > return np.tanh(self.gamma*np.inner(x,y) +self.coef0) > > def kernselect(kername): > switcher = { > 'linear': kPolynom, > 'rbf': kGauss, > 'sigmoid': kLinear, > 'poly': kSigmoid, > } > return switcher.get(kername, "nothing") > > cut_off = self.cut_ord_pair[0] > order = self.cut_ord_pair[1] > > from SeqKernel import hiSeqKernEval > > def getGram(Y): > gram_matrix = np.zeros((Y.shape[0], Y.shape[0])) > for row1ind in range(Y.shape[0]): > for row2ind in range > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > ... -------------- next part -------------- An HTML attachment was scrubbed... URL: From sambarnett95 at gmail.com Tue Jul 25 04:41:06 2017 From: sambarnett95 at gmail.com (Sam Barnett) Date: Tue, 25 Jul 2017 09:41:06 +0100 Subject: [scikit-learn] Fwd: Custom transformer failing check_estimator test In-Reply-To: References: Message-ID: This is the Traceback I get: AssertionErrorTraceback (most recent call last) in () ----> 1 check_estimator(OK.Sqizer) /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_ checks.pyc in check_estimator(Estimator) 253 check_parameters_default_constructible(name, Estimator) 254 for check in _yield_all_checks(name, Estimator): --> 255 check(name, Estimator) 256 257 /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/testing.pyc in wrapper(*args, **kwargs) 353 with warnings.catch_warnings(): 354 warnings.simplefilter("ignore", self.category) --> 355 return fn(*args, **kwargs) 356 357 return wrapper /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc in check_transformer_general(name, Transformer) 578 X = StandardScaler().fit_transform(X) 579 X -= X.min() --> 580 _check_transformer(name, Transformer, X, y) 581 _check_transformer(name, Transformer, X.tolist(), y.tolist()) 582 /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc in _check_transformer(name, Transformer, X, y) 671 if hasattr(X, 'T'): 672 # If it's not an array, it does not have a 'T' property --> 673 assert_raises(ValueError, transformer.transform, X.T) 674 675 /Users/Sam/anaconda/lib/python2.7/unittest/case.pyc in assertRaises(self, excClass, callableObj, *args, **kwargs) 471 return context 472 with context: --> 473 callableObj(*args, **kwargs) 474 475 def _getAssertEqualityFunc(self, first, second): /Users/Sam/anaconda/lib/python2.7/unittest/case.pyc in __exit__(self, exc_type, exc_value, tb) 114 exc_name = str(self.expected) 115 raise self.failureException( --> 116 "{0} not raised".format(exc_name)) 117 if not issubclass(exc_type, self.expected): 118 # let unexpected exceptions pass through AssertionError: ValueError not raised On Tue, Jul 25, 2017 at 12:54 AM, Joel Nothman wrote: > what is the failing test? please provide the full traceback. > > On 24 Jul 2017 10:58 pm, "Sam Barnett" wrote: > >> Dear scikit-learn developers, >> >> I am developing a transformer, named Sqizer, that has the ultimate goal >> of modifying a kernel for use with the sklearn.svm package. When given >> an input data array X, Sqizer.transform(X) should have as its output the >> Gram matrix for X using the modified version of the kernel. Here is the >> code for the class so far: >> >> class Sqizer(BaseEstimator, TransformerMixin): >> >> def __init__(self, C=1.0, kernel='rbf', degree=3, gamma=1, >> coef0=0.0, cut_ord_pair=(2,1)): >> self.C = C >> self.kernel = kernel >> self.degree = degree >> self.gamma = gamma >> self.coef0 = coef0 >> self.cut_ord_pair = cut_ord_pair >> >> def fit(self, X, y=None): >> # Check that X and y have correct shape >> X, y = check_X_y(X, y) >> # Store the classes seen during fit >> self.classes_ = unique_labels(y) >> >> self.X_ = X >> self.y_ = y >> return self >> >> def transform(self, X): >> >> X = check_array(X, warn_on_dtype=True) >> >> """Returns Gram matrix corresponding to X, once sqized.""" >> def kPolynom(x,y): >> return (self.coef0+self.gamma*np.inner(x,y))**self.degree >> def kGauss(x,y): >> return np.exp(-self.gamma*np.sum(np.square(x-y))) >> def kLinear(x,y): >> return np.inner(x,y) >> def kSigmoid(x,y): >> return np.tanh(self.gamma*np.inner(x,y) +self.coef0) >> >> def kernselect(kername): >> switcher = { >> 'linear': kPolynom, >> 'rbf': kGauss, >> 'sigmoid': kLinear, >> 'poly': kSigmoid, >> } >> return switcher.get(kername, "nothing") >> >> cut_off = self.cut_ord_pair[0] >> order = self.cut_ord_pair[1] >> >> from SeqKernel import hiSeqKernEval >> >> def getGram(Y): >> gram_matrix = np.zeros((Y. >> >> ... > > [Message clipped] > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sambarnett95 at gmail.com Tue Jul 25 08:15:28 2017 From: sambarnett95 at gmail.com (Sam Barnett) Date: Tue, 25 Jul 2017 13:15:28 +0100 Subject: [scikit-learn] Fwd: Custom transformer failing check_estimator test In-Reply-To: References: Message-ID: Apologies: I've since worked out what the problem was and have resolved this issue. This was what I was missing in my code: # Check that the input is of the same shape as the one passed # during fit. if X.shape != self.input_shape_: raise ValueError('Shape of input is different from what was seen' 'in `fit`') On Tue, Jul 25, 2017 at 9:41 AM, Sam Barnett wrote: > This is the Traceback I get: > > > AssertionErrorTraceback (most recent call last) > in () > ----> 1 check_estimator(OK.Sqizer) > > /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/ > utils/estimator_checks.pyc in check_estimator(Estimator) > 253 check_parameters_default_constructible(name, Estimator) > 254 for check in _yield_all_checks(name, Estimator): > --> 255 check(name, Estimator) > 256 > 257 > > /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/testing.pyc > in wrapper(*args, **kwargs) > 353 with warnings.catch_warnings(): > 354 warnings.simplefilter("ignore", self.category) > --> 355 return fn(*args, **kwargs) > 356 > 357 return wrapper > > /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc > in check_transformer_general(name, Transformer) > 578 X = StandardScaler().fit_transform(X) > 579 X -= X.min() > --> 580 _check_transformer(name, Transformer, X, y) > 581 _check_transformer(name, Transformer, X.tolist(), y.tolist()) > 582 > > /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc > in _check_transformer(name, Transformer, X, y) > 671 if hasattr(X, 'T'): > 672 # If it's not an array, it does not have a 'T' property > --> 673 assert_raises(ValueError, transformer.transform, X.T) > 674 > 675 > > /Users/Sam/anaconda/lib/python2.7/unittest/case.pyc in assertRaises(self, > excClass, callableObj, *args, **kwargs) > 471 return context > 472 with context: > --> 473 callableObj(*args, **kwargs) > 474 > 475 def _getAssertEqualityFunc(self, first, second): > > /Users/Sam/anaconda/lib/python2.7/unittest/case.pyc in __exit__(self, > exc_type, exc_value, tb) > 114 exc_name = str(self.expected) > 115 raise self.failureException( > --> 116 "{0} not raised".format(exc_name)) > 117 if not issubclass(exc_type, self.expected): > 118 # let unexpected exceptions pass through > > AssertionError: ValueError not raised > > > On Tue, Jul 25, 2017 at 12:54 AM, Joel Nothman > wrote: > >> what is the failing test? please provide the full traceback. >> >> On 24 Jul 2017 10:58 pm, "Sam Barnett" wrote: >> >>> Dear scikit-learn developers, >>> >>> I am developing a transformer, named Sqizer, that has the ultimate goal >>> of modifying a kernel for use with the sklearn.svm package. When given >>> an input data array X, Sqizer.transform(X) should have as its output >>> the Gram matrix for X using the modified version of the kernel. Here is >>> the code for the class so far: >>> >>> class Sqizer(BaseEstimator, TransformerMixin): >>> >>> def __init__(self, C=1.0, kernel='rbf', degree=3, gamma=1, >>> coef0=0.0, cut_ord_pair=(2,1)): >>> self.C = C >>> self.kernel = kernel >>> self.degree = degree >>> self.gamma = gamma >>> self.coef0 = coef0 >>> self.cut_ord_pair = cut_ord_pair >>> >>> def fit(self, X, y=None): >>> # Check that X and y have correct shape >>> X, y = check_X_y(X, y) >>> # Store the classes seen during fit >>> self.classes_ = unique_labels(y) >>> >>> self.X_ = X >>> self.y_ = y >>> return self >>> >>> def transform(self, X): >>> >>> X = check_array(X, warn_on_dtype=True) >>> >>> """Returns Gram matrix corresponding to X, once sqized.""" >>> def kPolynom(x,y): >>> return (self.coef0+self.gamma*np.inner(x,y))**self.degree >>> def kGauss(x,y): >>> return np.exp(-self.gamma*np.sum(np.square(x-y))) >>> def kLinear(x,y): >>> return np.inner(x,y) >>> def kSigmoid(x,y): >>> return np.tanh(self.gamma*np.inner(x,y) +self.coef0) >>> >>> def kernselect(kername): >>> switcher = { >>> 'linear': kPolynom, >>> 'rbf': kGauss, >>> 'sigmoid': kLinear, >>> 'poly': kSigmoid, >>> } >>> return switcher.get(kername, "nothing") >>> >>> cut_off = self.cut_ord_pair[0] >>> order = self.cut_ord_pair[1] >>> >>> from SeqKernel import hiSeqKernEval >>> >>> def getGram(Y): >>> gram_matrix = np.zeros((Y. >>> >>> ... >> >> [Message clipped] >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Jul 25 16:31:40 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 25 Jul 2017 16:31:40 -0400 Subject: [scikit-learn] Fwd: Custom transformer failing check_estimator test In-Reply-To: References: Message-ID: <42dff767-f7d8-b1c1-84f3-58ea4e4cab16@gmail.com> Indeed, it makes sure that the transform is applied to data with the same number of samples as the input. PR welcome to provide a better error message on this! On 07/25/2017 08:15 AM, Sam Barnett wrote: > Apologies: I've since worked out what the problem was and have > resolved this issue. This was what I was missing in my code: > > > # Check that the input is of the same shape as the one passed > # during fit. > if X.shape != self.input_shape_: > raise ValueError('Shape of input is different from what > was seen' > 'in `fit`') > > > On Tue, Jul 25, 2017 at 9:41 AM, Sam Barnett > wrote: > > This is the Traceback I get: > > > AssertionErrorTraceback (most recent call last) > in () > ----> 1 check_estimator(OK.Sqizer) > > /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc > in check_estimator(Estimator) > 253 check_parameters_default_constructible(name, Estimator) > 254 for check in _yield_all_checks(name, Estimator): > --> 255 check(name, Estimator) > 256 > 257 > > /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/testing.pyc > in wrapper(*args, **kwargs) > 353 with warnings.catch_warnings(): > 354 warnings.simplefilter("ignore", self.category) > --> 355 return fn(*args, **kwargs) > 356 > 357 return wrapper > > /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc > in check_transformer_general(name, Transformer) > 578 X = StandardScaler().fit_transform(X) > 579 X -= X.min() > --> 580 _check_transformer(name, Transformer, X, y) > 581 _check_transformer(name, Transformer, X.tolist(), y.tolist()) > 582 > > /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc > in _check_transformer(name, Transformer, X, y) > 671 if hasattr(X, 'T'): > 672 # If it's not an array, it does not have a 'T' > property > --> 673 assert_raises(ValueError, transformer.transform, X.T) > 674 > 675 > > /Users/Sam/anaconda/lib/python2.7/unittest/case.pyc in > assertRaises(self, excClass, callableObj, *args, **kwargs) > 471 return context > 472 with context: > --> 473 callableObj(*args, **kwargs) > 474 > 475 def _getAssertEqualityFunc(self, first, second): > > /Users/Sam/anaconda/lib/python2.7/unittest/case.pyc in > __exit__(self, exc_type, exc_value, tb) > 114 exc_name = str(self.expected) > 115 raise self.failureException( > --> 116 "{0} not raised".format(exc_name)) > 117 if not issubclass(exc_type, self.expected): > 118 # let unexpected exceptions pass through > > AssertionError: ValueError not raised > > > On Tue, Jul 25, 2017 at 12:54 AM, Joel Nothman > > wrote: > > what is the failing test? please provide the full traceback. > > On 24 Jul 2017 10:58 pm, "Sam Barnett" > wrote: > > Dear scikit-learn developers, > > I am developing a transformer, named |Sqizer|, that has > the ultimate goal of modifying a kernel for use with the > |sklearn.svm| package. When given an input data array |X|, > |Sqizer.transform(X)| should have as its output the Gram > matrix for |X| using the modified version of the kernel. > Here is the code for the class so far: > > |classSqizer(BaseEstimator,TransformerMixin):def__init__(self,C=1.0,kernel='rbf',degree=3,gamma=1,coef0=0.0,cut_ord_pair=(2,1)):self.C > =C self.kernel =kernel self.degree =degree self.gamma > =gamma self.coef0 =coef0 self.cut_ord_pair =cut_ord_pair > deffit(self,X,y=None):# Check that X and y have correct > shapeX,y =check_X_y(X,y)# Store the classes seen during > fitself.classes_ =unique_labels(y)self.X_ =X self.y_ =y > returnself deftransform(self,X):X > =check_array(X,warn_on_dtype=True)"""Returns Gram matrix > corresponding to X, once > sqized."""defkPolynom(x,y):return(self.coef0+self.gamma*np.inner(x,y))**self.degree > defkGauss(x,y):returnnp.exp(-self.gamma*np.sum(np.square(x-y)))defkLinear(x,y):returnnp.inner(x,y)defkSigmoid(x,y):returnnp.tanh(self.gamma*np.inner(x,y)+self.coef0)defkernselect(kername):switcher > ={'linear':kPolynom,'rbf':kGauss,'sigmoid':kLinear,'poly':kSigmoid,}returnswitcher.get(kername,"nothing")cut_off > =self.cut_ord_pair[0]order > =self.cut_ord_pair[1]fromSeqKernelimporthiSeqKernEval > defgetGram(Y):gram_matrix =np.zeros((Y.| > > ... > > [Message clipped] > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Tue Jul 25 19:58:35 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 26 Jul 2017 09:58:35 +1000 Subject: [scikit-learn] Fwd: Custom transformer failing check_estimator test In-Reply-To: <42dff767-f7d8-b1c1-84f3-58ea4e4cab16@gmail.com> References: <42dff767-f7d8-b1c1-84f3-58ea4e4cab16@gmail.com> Message-ID: One advantage of moving to pytest is that we can put messages into pytest.raises, and we should emphasise this in moving the check_estimator assertions to pytest. But I'm also not sure how we do the deprecation of nosetests for check_estimator in a way that is friendly to our contribbers... On 26 July 2017 at 06:31, Andreas Mueller wrote: > Indeed, it makes sure that the transform is applied to data with the same > number of samples as the input. > PR welcome to provide a better error message on this! > > On 07/25/2017 08:15 AM, Sam Barnett wrote: > > Apologies: I've since worked out what the problem was and have resolved > this issue. This was what I was missing in my code: > > > # Check that the input is of the same shape as the one passed > # during fit. > if X.shape != self.input_shape_: > raise ValueError('Shape of input is different from what was > seen' > 'in `fit`') > > > On Tue, Jul 25, 2017 at 9:41 AM, Sam Barnett > wrote: > >> This is the Traceback I get: >> >> >> AssertionErrorTraceback (most recent call last) >> in () >> ----> 1 check_estimator(OK.Sqizer) >> >> /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/util >> s/estimator_checks.pyc in check_estimator(Estimator) >> 253 check_parameters_default_constructible(name, Estimator) >> 254 for check in _yield_all_checks(name, Estimator): >> --> 255 check(name, Estimator) >> 256 >> 257 >> >> /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/testing.pyc >> in wrapper(*args, **kwargs) >> 353 with warnings.catch_warnings(): >> 354 warnings.simplefilter("ignore", self.category) >> --> 355 return fn(*args, **kwargs) >> 356 >> 357 return wrapper >> >> /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc >> in check_transformer_general(name, Transformer) >> 578 X = StandardScaler().fit_transform(X) >> 579 X -= X.min() >> --> 580 _check_transformer(name, Transformer, X, y) >> 581 _check_transformer(name, Transformer, X.tolist(), y.tolist()) >> 582 >> >> /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc >> in _check_transformer(name, Transformer, X, y) >> 671 if hasattr(X, 'T'): >> 672 # If it's not an array, it does not have a 'T' >> property >> --> 673 assert_raises(ValueError, transformer.transform, X.T) >> 674 >> 675 >> >> /Users/Sam/anaconda/lib/python2.7/unittest/case.pyc in assertRaises(self, >> excClass, callableObj, *args, **kwargs) >> 471 return context >> 472 with context: >> --> 473 callableObj(*args, **kwargs) >> 474 >> 475 def _getAssertEqualityFunc(self, first, second): >> >> /Users/Sam/anaconda/lib/python2.7/unittest/case.pyc in __exit__(self, >> exc_type, exc_value, tb) >> 114 exc_name = str(self.expected) >> 115 raise self.failureException( >> --> 116 "{0} not raised".format(exc_name)) >> 117 if not issubclass(exc_type, self.expected): >> 118 # let unexpected exceptions pass through >> >> AssertionError: ValueError not raised >> >> >> On Tue, Jul 25, 2017 at 12:54 AM, Joel Nothman >> wrote: >> >>> what is the failing test? please provide the full traceback. >>> >>> On 24 Jul 2017 10:58 pm, "Sam Barnett" wrote: >>> >>>> Dear scikit-learn developers, >>>> >>>> I am developing a transformer, named Sqizer, that has the ultimate >>>> goal of modifying a kernel for use with the sklearn.svm package. When >>>> given an input data array X, Sqizer.transform(X) should have as its >>>> output the Gram matrix for X using the modified version of the kernel. >>>> Here is the code for the class so far: >>>> >>>> class Sqizer(BaseEstimator, TransformerMixin): >>>> >>>> def __init__(self, C=1.0, kernel='rbf', degree=3, gamma=1, >>>> coef0=0.0, cut_ord_pair=(2,1)): >>>> self.C = C >>>> self.kernel = kernel >>>> self.degree = degree >>>> self.gamma = gamma >>>> self.coef0 = coef0 >>>> self.cut_ord_pair = cut_ord_pair >>>> >>>> def fit(self, X, y=None): >>>> # Check that X and y have correct shape >>>> X, y = check_X_y(X, y) >>>> # Store the classes seen during fit >>>> self.classes_ = unique_labels(y) >>>> >>>> self.X_ = X >>>> self.y_ = y >>>> return self >>>> >>>> def transform(self, X): >>>> >>>> X = check_array(X, warn_on_dtype=True) >>>> >>>> """Returns Gram matrix corresponding to X, once sqized.""" >>>> def kPolynom(x,y): >>>> return (self.coef0+self.gamma*np.inner(x,y))**self.degree >>>> def kGauss(x,y): >>>> return np.exp(-self.gamma*np.sum(np.square(x-y))) >>>> def kLinear(x,y): >>>> return np.inner(x,y) >>>> def kSigmoid(x,y): >>>> return np.tanh(self.gamma*np.inner(x,y) +self.coef0) >>>> >>>> def kernselect(kername): >>>> switcher = { >>>> 'linear': kPolynom, >>>> 'rbf': kGauss, >>>> 'sigmoid': kLinear, >>>> 'poly': kSigmoid, >>>> } >>>> return switcher.get(kername, "nothing") >>>> >>>> cut_off = self.cut_ord_pair[0] >>>> order = self.cut_ord_pair[1] >>>> >>>> from SeqKernel import hiSeqKernEval >>>> >>>> def getGram(Y): >>>> gram_matrix = np.zeros((Y. >>>> >>>> ... >>> >>> [Message clipped] >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Wed Jul 26 03:02:28 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Wed, 26 Jul 2017 09:02:28 +0200 Subject: [scikit-learn] Classifiers for dataset with categorical features In-Reply-To: References: <8434D3C4-503B-4D6B-A4FB-ABB684B7DD71@gmail.com> Message-ID: <20170726070228.GO3579441@phare.normalesup.org> The right thing to do would probably be to write a scikit-learn-contrib package for them and see if they gather traction. If they perform well on eg kaggle competitions, we know that we need them in :). Cheers, Ga?l On Fri, Jul 21, 2017 at 07:09:03PM -0400, Sebastian Raschka wrote: > Maybe because they are genetic algorithms, which are -- for some reason -- not very popular in the ML field in general :P. (People in bioinformatics seem to use them a lot though.). Also, the name "Learning Classifier Systems" is also a bit weird I'd must say: I remember that when Ryan introduced me to those, I was like "ah yeah, sure, I know machine learning classifiers" ;) > > On Jul 21, 2017, at 3:01 PM, Stuart Reynolds wrote: > > +1 > > LCS and its many many variants seem very practical and adaptable. I'm > > not sure why they haven't gotten traction. > > Overshadowed by GBM & random forests? > > On Fri, Jul 21, 2017 at 11:52 AM, Sebastian Raschka > > wrote: > >> Just to throw some additional ideas in here. Based on a conversation with a colleague some time ago, I think learning classifier systems (https://en.wikipedia.org/wiki/Learning_classifier_system) are particularly useful when working with large, sparse binary vectors (like from a one-hot encoding). I am really not into LCS's, and only know the basics (read through the first chapters of the Intro to Learning Classifier Systems draft; the print version will be out later this year). > >> Also, I saw an interesting poster on a Set Covering Machine algorithm once, which they benchmarked against SVMs, random forests and the like for categorical (genomics data). Looked promising. > >> Best, > >> Sebastian > >>> On Jul 21, 2017, at 2:37 PM, Raga Markely wrote: > >>> Thank you, Jacob. Appreciate it. > >>> Regarding 'perform better', I was referring to better accuracy, precision, recall, F1 score, etc. > >>> Thanks, > >>> Raga > >>> On Fri, Jul 21, 2017 at 2:27 PM, Jacob Schreiber wrote: > >>> Traditionally tree based methods are very good when it comes to categorical variables and can handle them appropriately. There is a current WIP PR to add this support to sklearn. I'm not exactly sure what you mean that "perform better" though. Estimators that ignore the categorical aspect of these variables and treat them as discrete will likely perform worse than those that treat them appropriately. > >>> On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely wrote: > >>> Hello, > >>> I am wondering if there are some classifiers that perform better for datasets with categorical features (converted into sparse input matrix with pd.get_dummies())? The data for the categorical features are nominal (order doesn't matter, e.g. country, occupation, etc). > >>> If you could provide me some references (papers, books, website, etc), that would be great. > >>> Thank you very much! > >>> Raga > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From t3kcit at gmail.com Wed Jul 26 10:54:49 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 26 Jul 2017 10:54:49 -0400 Subject: [scikit-learn] Fwd: Custom transformer failing check_estimator test In-Reply-To: References: <42dff767-f7d8-b1c1-84f3-58ea4e4cab16@gmail.com> Message-ID: <130377b1-0557-698a-0ef3-b71a201cb2aa@gmail.com> Hm, it would be nice to do this in a way that relies less on pytest, but I guess that would be tricky. One way would be to use assert_raise_message to make clear what the expected error is. But that would make the current test more strict - not necessarily that bad, I guess? It looks like all asserts in unittest have a "msg" argument... apart from assertRaises: https://docs.python.org/2/library/unittest.html#unittest.TestCase.assertRaises That has been fixed in Python 3.3, though: https://docs.python.org/3/library/unittest.html#unittest.TestCase.assertRaises So maybe we should just do a backport for assert_raises and assert_raises_regex? On 07/25/2017 07:58 PM, Joel Nothman wrote: > One advantage of moving to pytest is that we can put messages into > pytest.raises, and we should emphasise this in moving the > check_estimator assertions to pytest. But I'm also not sure how we do > the deprecation of nosetests for check_estimator in a way that is > friendly to our contribbers... > > On 26 July 2017 at 06:31, Andreas Mueller > wrote: > > Indeed, it makes sure that the transform is applied to data with > the same number of samples as the input. > PR welcome to provide a better error message on this! > > On 07/25/2017 08:15 AM, Sam Barnett wrote: >> Apologies: I've since worked out what the problem was and have >> resolved this issue. This was what I was missing in my code: >> >> >> # Check that the input is of the same shape as the one passed >> # during fit. >> if X.shape != self.input_shape_: >> raise ValueError('Shape of input is different from what was seen' >> 'in `fit`') >> >> >> On Tue, Jul 25, 2017 at 9:41 AM, Sam Barnett >> > wrote: >> >> This is the Traceback I get: >> >> >> AssertionErrorTraceback (most recent call last) >> in () >> ----> 1 check_estimator(OK.Sqizer) >> >> /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc >> in check_estimator(Estimator) >> 253 check_parameters_default_constructible(name, Estimator) >> 254 for check in _yield_all_checks(name, Estimator): >> --> 255 check(name, Estimator) >> 256 >> 257 >> >> /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/testing.pyc >> in wrapper(*args, **kwargs) >> 353 with warnings.catch_warnings(): >> 354 warnings.simplefilter("ignore", >> self.category) >> --> 355 return fn(*args, **kwargs) >> 356 >> 357 return wrapper >> >> /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc >> in check_transformer_general(name, Transformer) >> 578 X = StandardScaler().fit_transform(X) >> 579 X -= X.min() >> --> 580 _check_transformer(name, Transformer, X, y) >> 581 _check_transformer(name, Transformer, X.tolist(), y.tolist()) >> 582 >> >> /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc >> in _check_transformer(name, Transformer, X, y) >> 671 if hasattr(X, 'T'): >> 672 # If it's not an array, it does not have a >> 'T' property >> --> 673 assert_raises(ValueError, transformer.transform, X.T) >> 674 >> 675 >> >> /Users/Sam/anaconda/lib/python2.7/unittest/case.pyc in >> assertRaises(self, excClass, callableObj, *args, **kwargs) >> 471 return context >> 472 with context: >> --> 473 callableObj(*args, **kwargs) >> 474 >> 475 def _getAssertEqualityFunc(self, first, second): >> >> /Users/Sam/anaconda/lib/python2.7/unittest/case.pyc in >> __exit__(self, exc_type, exc_value, tb) >> 114 exc_name = str(self.expected) >> 115 raise self.failureException( >> --> 116 "{0} not raised".format(exc_name)) >> 117 if not issubclass(exc_type, self.expected): >> 118 # let unexpected exceptions pass through >> >> AssertionError: ValueError not raised >> >> >> On Tue, Jul 25, 2017 at 12:54 AM, Joel Nothman >> > wrote: >> >> what is the failing test? please provide the full traceback. >> >> On 24 Jul 2017 10:58 pm, "Sam Barnett" >> > >> wrote: >> >> Dear scikit-learn developers, >> >> I am developing a transformer, named |Sqizer|, that >> has the ultimate goal of modifying a kernel for use >> with the |sklearn.svm| package. When given an input >> data array |X|, |Sqizer.transform(X)| should have as >> its output the Gram matrix for |X| using the modified >> version of the kernel. Here is the code for the class >> so far: >> >> |classSqizer(BaseEstimator,TransformerMixin):def__init__(self,C=1.0,kernel='rbf',degree=3,gamma=1,coef0=0.0,cut_ord_pair=(2,1)):self.C >> =C self.kernel =kernel self.degree =degree self.gamma >> =gamma self.coef0 =coef0 self.cut_ord_pair >> =cut_ord_pair deffit(self,X,y=None):# Check that X >> and y have correct shapeX,y =check_X_y(X,y)# Store >> the classes seen during fitself.classes_ >> =unique_labels(y)self.X_ =X self.y_ =y returnself >> deftransform(self,X):X >> =check_array(X,warn_on_dtype=True)"""Returns Gram >> matrix corresponding to X, once >> sqized."""defkPolynom(x,y):return(self.coef0+self.gamma*np.inner(x,y))**self.degree >> defkGauss(x,y):returnnp.exp(-self.gamma*np.sum(np.square(x-y)))defkLinear(x,y):returnnp.inner(x,y)defkSigmoid(x,y):returnnp.tanh(self.gamma*np.inner(x,y)+self.coef0)defkernselect(kername):switcher >> ={'linear':kPolynom,'rbf':kGauss,'sigmoid':kLinear,'poly':kSigmoid,}returnswitcher.get(kername,"nothing")cut_off >> =self.cut_ord_pair[0]order >> =self.cut_ord_pair[1]fromSeqKernelimporthiSeqKernEval >> defgetGram(Y):gram_matrix =np.zeros((Y.| >> >> ... >> >> [Message clipped] >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From renato.deleone at gmail.com Wed Jul 26 12:26:11 2017 From: renato.deleone at gmail.com (Renato De Leone) Date: Wed, 26 Jul 2017 18:26:11 +0200 Subject: [scikit-learn] Showing loss value in MLPClassifier Message-ID: Is it possible to show additional information such as current value of the loss function etc in MLPClassifier. Apparently verbose=TRue does not make any difference. Thanks -- Renato -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Wed Jul 26 18:04:58 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 27 Jul 2017 08:04:58 +1000 Subject: [scikit-learn] Fwd: Custom transformer failing check_estimator test In-Reply-To: <130377b1-0557-698a-0ef3-b71a201cb2aa@gmail.com> References: <42dff767-f7d8-b1c1-84f3-58ea4e4cab16@gmail.com> <130377b1-0557-698a-0ef3-b71a201cb2aa@gmail.com> Message-ID: The difference is the functional form versus the context manager. you can't add extra parameters to the function, only to the context manager. On 27 Jul 2017 12:56 am, "Andreas Mueller" wrote: > Hm, it would be nice to do this in a way that relies less on pytest, but I > guess that would be tricky. > One way would be to use assert_raise_message to make clear what the > expected error is. > But that would make the current test more strict - not necessarily that > bad, I guess? > > It looks like all asserts in unittest have a "msg" argument... apart from > assertRaises: > https://docs.python.org/2/library/unittest.html# > unittest.TestCase.assertRaises > > That has been fixed in Python 3.3, though: > https://docs.python.org/3/library/unittest.html# > unittest.TestCase.assertRaises > > So maybe we should just do a backport for assert_raises and > assert_raises_regex? > > > On 07/25/2017 07:58 PM, Joel Nothman wrote: > > One advantage of moving to pytest is that we can put messages into > pytest.raises, and we should emphasise this in moving the check_estimator > assertions to pytest. But I'm also not sure how we do the deprecation of > nosetests for check_estimator in a way that is friendly to our > contribbers... > > On 26 July 2017 at 06:31, Andreas Mueller wrote: > >> Indeed, it makes sure that the transform is applied to data with the same >> number of samples as the input. >> PR welcome to provide a better error message on this! >> >> On 07/25/2017 08:15 AM, Sam Barnett wrote: >> >> Apologies: I've since worked out what the problem was and have resolved >> this issue. This was what I was missing in my code: >> >> >> # Check that the input is of the same shape as the one passed >> # during fit. >> if X.shape != self.input_shape_: >> raise ValueError('Shape of input is different from what was >> seen' >> 'in `fit`') >> >> >> On Tue, Jul 25, 2017 at 9:41 AM, Sam Barnett >> wrote: >> >>> This is the Traceback I get: >>> >>> >>> AssertionErrorTraceback (most recent call last) >>> in () >>> ----> 1 check_estimator(OK.Sqizer) >>> >>> /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/util >>> s/estimator_checks.pyc in check_estimator(Estimator) >>> 253 check_parameters_default_constructible(name, Estimator) >>> 254 for check in _yield_all_checks(name, Estimator): >>> --> 255 check(name, Estimator) >>> 256 >>> 257 >>> >>> /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/testing.pyc >>> in wrapper(*args, **kwargs) >>> 353 with warnings.catch_warnings(): >>> 354 warnings.simplefilter("ignore", self.category) >>> --> 355 return fn(*args, **kwargs) >>> 356 >>> 357 return wrapper >>> >>> /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc >>> in check_transformer_general(name, Transformer) >>> 578 X = StandardScaler().fit_transform(X) >>> 579 X -= X.min() >>> --> 580 _check_transformer(name, Transformer, X, y) >>> 581 _check_transformer(name, Transformer, X.tolist(), >>> y.tolist()) >>> 582 >>> >>> /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc >>> in _check_transformer(name, Transformer, X, y) >>> 671 if hasattr(X, 'T'): >>> 672 # If it's not an array, it does not have a 'T' >>> property >>> --> 673 assert_raises(ValueError, transformer.transform, X.T >>> ) >>> 674 >>> 675 >>> >>> /Users/Sam/anaconda/lib/python2.7/unittest/case.pyc in assertRaises(self, >>> excClass, callableObj, *args, **kwargs) >>> 471 return context >>> 472 with context: >>> --> 473 callableObj(*args, **kwargs) >>> 474 >>> 475 def _getAssertEqualityFunc(self, first, second): >>> >>> /Users/Sam/anaconda/lib/python2.7/unittest/case.pyc in __exit__(self, >>> exc_type, exc_value, tb) >>> 114 exc_name = str(self.expected) >>> 115 raise self.failureException( >>> --> 116 "{0} not raised".format(exc_name)) >>> 117 if not issubclass(exc_type, self.expected): >>> 118 # let unexpected exceptions pass through >>> >>> AssertionError: ValueError not raised >>> >>> >>> On Tue, Jul 25, 2017 at 12:54 AM, Joel Nothman >>> wrote: >>> >>>> what is the failing test? please provide the full traceback. >>>> >>>> On 24 Jul 2017 10:58 pm, "Sam Barnett" wrote: >>>> >>>>> Dear scikit-learn developers, >>>>> >>>>> I am developing a transformer, named Sqizer, that has the ultimate >>>>> goal of modifying a kernel for use with the sklearn.svm package. When >>>>> given an input data array X, Sqizer.transform(X) should have as its >>>>> output the Gram matrix for X using the modified version of the >>>>> kernel. Here is the code for the class so far: >>>>> >>>>> class Sqizer(BaseEstimator, TransformerMixin): >>>>> >>>>> def __init__(self, C=1.0, kernel='rbf', degree=3, gamma=1, >>>>> coef0=0.0, cut_ord_pair=(2,1)): >>>>> self.C = C >>>>> self.kernel = kernel >>>>> self.degree = degree >>>>> self.gamma = gamma >>>>> self.coef0 = coef0 >>>>> self.cut_ord_pair = cut_ord_pair >>>>> >>>>> def fit(self, X, y=None): >>>>> # Check that X and y have correct shape >>>>> X, y = check_X_y(X, y) >>>>> # Store the classes seen during fit >>>>> self.classes_ = unique_labels(y) >>>>> >>>>> self.X_ = X >>>>> self.y_ = y >>>>> return self >>>>> >>>>> def transform(self, X): >>>>> >>>>> X = check_array(X, warn_on_dtype=True) >>>>> >>>>> """Returns Gram matrix corresponding to X, once sqized.""" >>>>> def kPolynom(x,y): >>>>> return (self.coef0+self.gamma*np.inner(x,y))**self.degree >>>>> def kGauss(x,y): >>>>> return np.exp(-self.gamma*np.sum(np.square(x-y))) >>>>> def kLinear( >>>>> >>>>> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > ... -------------- next part -------------- An HTML attachment was scrubbed... URL: From saladi at caltech.edu Thu Jul 27 14:47:33 2017 From: saladi at caltech.edu (Shyam Saladi) Date: Thu, 27 Jul 2017 11:47:33 -0700 Subject: [scikit-learn] Maximum Dissimilarity Sampling Message-ID: Hello all, I'm looking to sample a large dataset for a subset that best covers the space. One way of doing this would be maximum dissimilarity, say as implemented in R as part of caret::maxDissim . Is anyone are of similar functionality available as part of a common Python package, perhaps in scikit-learn? Many thanks in advance, Shyam -------------- next part -------------- An HTML attachment was scrubbed... URL: From masa.kondo5 at gmail.com Thu Jul 27 16:16:06 2017 From: masa.kondo5 at gmail.com (Masanari Kondo) Date: Thu, 27 Jul 2017 16:16:06 -0400 Subject: [scikit-learn] =?utf-8?q?Question_about_the_Library_of_=E2=80=9C?= =?utf-8?q?sklearn=2Eneural=5Fnetwork=2EBernoulliRBM=E2=80=9D_that_?= =?utf-8?q?Creates_Highly_Correlated_Features=2E?= Message-ID: Dear all, I?m using the sklearn library to generate new features of a dataset using a Restricted Boltzmann Machine (RBM, sklearn.neural_network.BernoulliRBM). I use the following environment: python 3.5.0 numpy==1.11.1 scikit-learn==0.18 I have already tried a large number of iterations (n_iter=6000) and a low learning rate (0.0001) for all training data (373 samples). However, The new features that are generated by the RBM are all highly correlated. Can anyone explain why this happens? Below is a MWE: import numpy as np import csv from sklearn.neural_network import BernoulliRBM # train data train_data = np.array( [[0.0326086956522,0.0,0.0,0.0200400801603,0.0674157303371,0.000805152979066,0.00200803212851,0.243243243243,0.0123456790123,0.55,0.0233428760185,0.0,0.0,0.0,0.444444444,0.0,0.0,0.157556270138,0.0188679245283,0.0983652512615], [0.0108695652174,0.2,0.0,0.00200400801603,0.0112359550562,0.0,0.0,0.027027027027,0.0123456790123,1.0,0.00154151068047,0.0,0.0,1.0,1.0,0.0,0.0,0.0289389067571,0.0,0.0], [0.0869565217391,0.0,0.152542372881,0.0260521042084,0.0749063670412,0.00322061191626,0.0180722891566,0.108108108108,0.0987654320988,0.4,0.022241796961,0.2,0.0909090909091,0.0,0.40625,0.0,0.0,0.053054662388,0.0188679245283,0.129097937384], [0.0326086956522,0.2,0.0847457627119,0.0140280561122,0.0149812734082,0.000268384326355,0.0120481927711,0.027027027027,0.0246913580247,0.25,0.00352345298392,1.0,0.0,0.75,0.555555556,0.0,0.0,0.0192926045047,0.0188679245283,0.0983652512615], [0.0978260869565,0.0,0.0,0.0100200400802,0.0711610486891,0.00214707461084,0.00803212851406,0.027027027027,0.111111111111,0.265625,0.0262056815679,1.0,0.0,0.0,0.518518519,0.0,0.0,0.0568060021635,0.0566037735849,0.213107498008], [0.0760869565217,0.8,0.0,0.0180360721443,0.0936329588015,0.0,0.0120481927711,0.0810810810811,0.0864197530864,0.3333333335,0.0561550319313,0.0,0.0,0.863636364,0.342857143,0.5,0.333333333333,0.168121267841,0.169811320755,0.463705037033], [0.0978260869565,1.0,0.0,0.0100200400802,0.063670411985,0.00697799248524,0.0,0.135135135135,0.0740740740741,0.4166666665,0.0156353226162,0.0,0.0,0.949367089,0.333333333,0.25,0.266666666667,0.0316184351626,0.0566037735849,0.163932249402], [0.0326086956522,0.2,0.0,0.0380761523046,0.0374531835206,0.000805152979066,0.0281124497992,0.135135135135,0.037037037037,1.0,0.00836820083682,0.0,0.0,0.923076923,0.583333333,0.0,0.0,0.0562700964881,0.0188679245283,0.0491752486057], [0.0108695652174,0.0,0.0,0.0200400801603,0.00374531835206,0.0,0.0160642570281,0.0540540540541,0.0123456790123,1.0,0.000220215811495,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0188679245283,0.147540499867], [0.217391304348,0.0,0.0,0.0140280561122,0.295880149813,0.0365002683843,0.0100401606426,0.135135135135,0.123456790123,0.4487534625,0.183880202599,1.0,0.0909090909091,0.0,0.19375,0.0,0.0,0.191961414822,0.188679245283,0.287703974741], [0.0652173913043,0.0,0.0,0.0160320641283,0.0224719101124,0.00402576489533,0.0140562248996,0.027027027027,0.0740740740741,1.0,0.00132129486897,0.0,0.0,0.0,0.444444444,0.0,0.0,0.0,0.0188679245283,0.147540499867], [0.0326086956522,0.6,0.0,0.0100200400802,0.0411985018727,0.000268384326355,0.00200803212851,0.108108108108,0.0123456790123,0.25,0.00902884827131,1.0,0.0909090909091,0.971428571,0.75,0.25,0.133333333333,0.0594855305401,0.0566037735849,0.147540499867], [0.119565217391,0.2,0.0,0.0140280561122,0.0973782771536,0.0,0.0100401606426,0.0540540540541,0.135802469136,0.29,0.0398590618806,1.0,0.0,0.529411765,0.409090909,0.0,0.0,0.0723472668927,0.0188679245283,0.107306205553], [0.0326086956522,0.2,0.0,0.0100200400802,0.0262172284644,0.000268384326355,0.00200803212851,0.108108108108,0.037037037037,0.25,0.00638625853336,1.0,0.0,0.818181818,0.666666667,0.0,0.0,0.0401929260499,0.0188679245283,0.0983652512615], [0.173913043478,0.4,0.0,0.0300601202405,0.243445692884,0.020397208803,0.0,0.405405405405,0.16049382716,0.46,0.106364236952,1.0,0.0,0.725490196,0.311111111,0.0,0.0,0.136254019315,0.169811320755,0.230532031043], [0.163043478261,0.4,0.0,0.0180360721443,0.153558052434,0.0,0.0,0.243243243243,0.185185185185,0.3392857145,0.044924025545,1.0,0.0909090909091,0.725490196,0.225,0.25,0.133333333333,0.0594855305401,0.0377358490566,0.226223848446], [0.152173913043,0.6,0.0508474576271,0.0220440881764,0.10861423221,0.0228126677402,0.00602409638554,0.216216216216,0.135802469136,0.2884615385,0.0237833076415,1.0,0.0909090909091,0.759259259,0.321428571,0.0,0.0,0.0316949931128,0.0754716981132,0.189692820679], [0.29347826087,0.4,0.0,0.0160320641283,0.378277153558,0.0421363392378,0.0100401606426,0.0810810810811,0.185185185185,0.4123931625,0.283197533583,0.888888889,0.0909090909091,0.294117647,0.183760684,0.25,0.466666666667,0.220078599537,0.0754716981132,0.163932249402], [0.0326086956522,0.0,0.0,0.00400801603206,0.0112359550562,0.000805152979066,0.00401606425703,0.0,0.037037037037,0.75,0.000880863245981,0.0,0.0,0.0,0.666666667,0.0,0.0,0.0,0.0188679245283,0.147540499867], [0.597826086957,0.4,0.135593220339,0.0400801603206,0.397003745318,0.352388620505,0.0160642570281,0.324324324324,0.111111111111,0.4782763535,0.249504514424,1.0,0.181818181818,0.406593407,0.195454545,0.0,0.0,0.0922537270084,0.188679245283,0.273613857004]] ) # define the RBM model random_state = 200 model = BernoulliRBM(n_components=10,n_iter=10,random_state=random_state) # building RBM and creating RBM features # Each column means one feature, each row means one line of the train data. RBM_feature_data = model.fit_transform(train_data) print(RBM_feature_data) Thank you! Masanari Kondo -------------- next part -------------- An HTML attachment was scrubbed... URL: From abhishekraj10 at yahoo.com Fri Jul 28 13:01:25 2017 From: abhishekraj10 at yahoo.com (Abhishek Raj) Date: Fri, 28 Jul 2017 22:31:25 +0530 Subject: [scikit-learn] Are sample weights normalized? Message-ID: Hi, I am using one class svm for binary classification and was just curious what is the range/scale for sample weights? Are they normalized internally? For example - Sample 1, weight - 1 Sample 2, weight - 10 Sample 3, weight - 100 Does this mean Sample 3 will always be predicted as positive and sample 1 will never be predicted as positive? What about sample 2? Also, what would happen if I assign a high weight to majority of the samples and low weights to the rest. Eg if 80% of my samples were weighted 1000 and 20% were weighted 1. A clarification or a link to read up on how exactly weights affect the training process would be really helpful. Thanks, Abhishek -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.eickenberg at gmail.com Fri Jul 28 13:11:00 2017 From: michael.eickenberg at gmail.com (Michael Eickenberg) Date: Fri, 28 Jul 2017 10:11:00 -0700 Subject: [scikit-learn] Are sample weights normalized? In-Reply-To: References: Message-ID: Hi Abhishek, think of your example as being equivalent to putting 1 of sample 1, 10 of sample 2 and 100 of sample 3 in a dataset and then run your SVM. This is exactly true for some estimators and approximately true for others, but always a good intuition. Hope this helps! Michael On Fri, Jul 28, 2017 at 10:01 AM, Abhishek Raj via scikit-learn < scikit-learn at python.org> wrote: > Hi, > > I am using one class svm for binary classification and was just curious > what is the range/scale for sample weights? Are they normalized internally? > For example - > > Sample 1, weight - 1 > Sample 2, weight - 10 > Sample 3, weight - 100 > > Does this mean Sample 3 will always be predicted as positive and sample 1 > will never be predicted as positive? What about sample 2? > > Also, what would happen if I assign a high weight to majority of the > samples and low weights to the rest. Eg if 80% of my samples were weighted > 1000 and 20% were weighted 1. > > A clarification or a link to read up on how exactly weights affect the > training process would be really helpful. > > Thanks, > Abhishek > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From abhishekraj10 at yahoo.com Fri Jul 28 16:06:49 2017 From: abhishekraj10 at yahoo.com (Abhishek Raj) Date: Sat, 29 Jul 2017 01:36:49 +0530 Subject: [scikit-learn] Are sample weights normalized? In-Reply-To: References: Message-ID: Hi Michael, thanks for the response. Based on what you said, is it correct to assume that weights are relative to the size of the data set? Eg If my dataset size is 200 and I have 1 of sample 1, 10 of sample 2 and 100 of sample 3, sample 3 will be given a lot of focus during training because it exists in majority, but if my dataset size was say 1 million, these weights wouldn't really affect much? Thanks, Abhishek On Jul 28, 2017 10:41 PM, "Michael Eickenberg" wrote: > Hi Abhishek, > > think of your example as being equivalent to putting 1 of sample 1, 10 of > sample 2 and 100 of sample 3 in a dataset and then run your SVM. > This is exactly true for some estimators and approximately true for > others, but always a good intuition. > > Hope this helps! > Michael > > > On Fri, Jul 28, 2017 at 10:01 AM, Abhishek Raj via scikit-learn < > scikit-learn at python.org> wrote: > >> Hi, >> >> I am using one class svm for binary classification and was just curious >> what is the range/scale for sample weights? Are they normalized internally? >> For example - >> >> Sample 1, weight - 1 >> Sample 2, weight - 10 >> Sample 3, weight - 100 >> >> Does this mean Sample 3 will always be predicted as positive and sample 1 >> will never be predicted as positive? What about sample 2? >> >> Also, what would happen if I assign a high weight to majority of the >> samples and low weights to the rest. Eg if 80% of my samples were weighted >> 1000 and 20% were weighted 1. >> >> A clarification or a link to read up on how exactly weights affect the >> training process would be really helpful. >> >> Thanks, >> Abhishek >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.eickenberg at gmail.com Fri Jul 28 16:29:24 2017 From: michael.eickenberg at gmail.com (Michael Eickenberg) Date: Fri, 28 Jul 2017 13:29:24 -0700 Subject: [scikit-learn] Are sample weights normalized? In-Reply-To: References: Message-ID: Well, that will depend on how your estimator works. But in general you are right - if you assume that samples 4 to N are weighted with the same weight (e.g. 1) in both cases, then the sample 3 will be relatively less important in the larger training set. On Fri, Jul 28, 2017 at 1:06 PM, Abhishek Raj via scikit-learn < scikit-learn at python.org> wrote: > Hi Michael, thanks for the response. Based on what you said, is it correct > to assume that weights are relative to the size of the data set? Eg > > If my dataset size is 200 and I have 1 of sample 1, 10 of sample 2 and 100 > of sample 3, sample 3 will be given a lot of focus during training because > it exists in majority, but if my dataset size was say 1 million, these > weights wouldn't really affect much? > > Thanks, > Abhishek > > On Jul 28, 2017 10:41 PM, "Michael Eickenberg" < > michael.eickenberg at gmail.com> wrote: > >> Hi Abhishek, >> >> think of your example as being equivalent to putting 1 of sample 1, 10 of >> sample 2 and 100 of sample 3 in a dataset and then run your SVM. >> This is exactly true for some estimators and approximately true for >> others, but always a good intuition. >> >> Hope this helps! >> Michael >> >> >> On Fri, Jul 28, 2017 at 10:01 AM, Abhishek Raj via scikit-learn < >> scikit-learn at python.org> wrote: >> >>> Hi, >>> >>> I am using one class svm for binary classification and was just curious >>> what is the range/scale for sample weights? Are they normalized internally? >>> For example - >>> >>> Sample 1, weight - 1 >>> Sample 2, weight - 10 >>> Sample 3, weight - 100 >>> >>> Does this mean Sample 3 will always be predicted as positive and sample >>> 1 will never be predicted as positive? What about sample 2? >>> >>> Also, what would happen if I assign a high weight to majority of the >>> samples and low weights to the rest. Eg if 80% of my samples were weighted >>> 1000 and 20% were weighted 1. >>> >>> A clarification or a link to read up on how exactly weights affect the >>> training process would be really helpful. >>> >>> Thanks, >>> Abhishek >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yrohinkumar at gmail.com Sun Jul 30 13:38:15 2017 From: yrohinkumar at gmail.com (Rohin Kumar) Date: Sun, 30 Jul 2017 23:08:15 +0530 Subject: [scikit-learn] Nearest neighbor search with 2 distance measures Message-ID: Dear all, This is my first post on this forum. May be it is a feature request or about something I don't know how to get it work. My question is on BallTree algorithm with custom metrics. I am working with a dataset for which I was calculating two point correlation with one distance metric using BallTree algorithm. Say import numpy as npfrom sklearn.neighbors import BallTree np.random.seed(0) X = np.random.random((30, 3)) r = np.linspace(0, 1, 5) tree = BallTree(X,metric='euclidean') tree.two_point_correlation(X, r) Now, I want to calculate two-point correlation based on two different metrics. Imagine I want to find correlation based on their distances on XZ and YZ planes - group neighbors based on two distances instead of one. Say I want to find correlation within r1 and r2 bins based on two different distance metrics say something like r1 = np.linspace(0, 1, 5) r2 = np.linspace(0, 1, 5) tree = BallTree(X,metric1=?euclidean2D?,metric2=?euclidean2D?) tree.two_point_correlation(X, r1, r2) How can I go about doing that? Goal is to get a contour plot of two-point correlation with r1 and r2 as axes. Any help on this would be great! Thanks in advance, Rohin. -------------- next part -------------- An HTML attachment was scrubbed... URL: From AlexeyUm at yandex.ru Sun Jul 30 13:40:21 2017 From: AlexeyUm at yandex.ru (=?utf-8?B?ItCj0LzQvdC+0LIg0JDQu9C10LrRgdC10LkgKEFsZXhleSBVbW5vdiki?=) Date: Sun, 30 Jul 2017 20:40:21 +0300 Subject: [scikit-learn] Nearest neighbor search with 2 distance measures Message-ID: <379121501436421@mxfront4j.mail.yandex.net> ????????????! ? ? ?????? ?????? ? ???????, ? ??????? 15 ???????. -- ????? ??????? From yrohinkumar at gmail.com Sun Jul 30 14:18:29 2017 From: yrohinkumar at gmail.com (Rohin Kumar) Date: Sun, 30 Jul 2017 23:48:29 +0530 Subject: [scikit-learn] Nearest neighbor search with 2 distance measures In-Reply-To: <379121501436421@mxfront4j.mail.yandex.net> References: <379121501436421@mxfront4j.mail.yandex.net> Message-ID: *update* May be it doesn't have to be done at the tree creation level. It could be using loops and creating two different balltrees. Something like tree1=BallTree(X,metric='metric1') #for x-z plane tree2=BallTree(X,metric='metric2') #for y-z plane And then calculate correlation functions in a loop to get tpcf(X,r1,r2) using tree1.two_point_correlation(X,r1) and tree2.two_point_correlation(X,r2) -------------- next part -------------- An HTML attachment was scrubbed... URL: From AlexeyUm at yandex.ru Sun Jul 30 14:20:30 2017 From: AlexeyUm at yandex.ru (=?utf-8?B?ItCj0LzQvdC+0LIg0JDQu9C10LrRgdC10LkgKEFsZXhleSBVbW5vdiki?=) Date: Sun, 30 Jul 2017 21:20:30 +0300 Subject: [scikit-learn] Nearest neighbor search with 2 distance measures Message-ID: <33421501438830@mxfront7o.mail.yandex.net> ????????????! ? ? ?????? ?????? ? ???????, ? ??????? 15 ???????. -- ????? ??????? From jakevdp at cs.washington.edu Mon Jul 31 10:46:31 2017 From: jakevdp at cs.washington.edu (Jacob Vanderplas) Date: Mon, 31 Jul 2017 07:46:31 -0700 Subject: [scikit-learn] Nearest neighbor search with 2 distance measures In-Reply-To: References: <379121501436421@mxfront4j.mail.yandex.net> Message-ID: On Sun, Jul 30, 2017 at 11:18 AM, Rohin Kumar wrote: > *update* > > May be it doesn't have to be done at the tree creation level. It could be > using loops and creating two different balltrees. Something like > > tree1=BallTree(X,metric='metric1') #for x-z plane > tree2=BallTree(X,metric='metric2') #for y-z plane > > And then calculate correlation functions in a loop to get tpcf(X,r1,r2) > using tree1.two_point_correlation(X,r1) and tree2.two_point_ > correlation(X,r2) > Hi Rohin, It's not exactly clear to me what you wish the tree to do with the two different metrics, but in any case the ball tree only supports one metric at a time. If you can construct your desired result from two ball trees each with its own metric, then that's probably the best way to proceed, Jake > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: