From yusuke.nishioka.0713 at gmail.com Mon Nov 6 21:42:45 2017 From: yusuke.nishioka.0713 at gmail.com (Yusuke Nishioka) Date: Tue, 7 Nov 2017 11:42:45 +0900 Subject: [scikit-learn] Question about dummy coding using DictVectorizer or FeatureHasher: generating correlated dimensions Message-ID: Hello, I have a question about dummy coding using DictVectorizer or FeatureHasher. ``` >>> from sklearn.feature_extraction import DictVectorizer, FeatureHasher >>> D = [{'age': 23, 'gender': 'm'},{'age': 34, 'gender': 'f'},{'age': 18, 'gender': 'f'},{'age': 50, 'gender': 'm'}] >>> m1 = FeatureHasher(n_features=10) >>> m1.fit_transform(D).toarray() array([[ 0., 0., -1., 0., 0., 0., 0., 0., 0., 23.], [ 0., 0., 0., 0., 0., 0., 0., 0., 1., 34.], [ 0., 0., 0., 0., 0., 0., 0., 0., 1., 18.], [ 0., 0., -1., 0., 0., 0., 0., 0., 0., 50.]]) >>> m2 = DictVectorizer(sparse=False) >>> m2.fit_transform(D) array([[ 23., 0., 1.], [ 34., 1., 0.], [ 18., 1., 0.], [ 50., 0., 1.]]) >>> m2.feature_names_ ['age', 'gender=f', 'gender=m'] ``` Since both DictVectorizer and FeatureHasher generate dimensions for 'gender=m' and 'gender=f', these dimensions are perfectly correlated. This is because DictVectorizer and FeatureHasher by default generate n dimensions for n categorical values of 1 feature. My questions are as follows: 1. My expectation is for them to generate n-1 dimensions for n categorical values, and is there any way to do this using DictVectorizer and FeatureHasher? 2. How should I handle these correlated dimensions? In my understanding, the training on data which has colinearity will make prediction unstable. Will L1 or L2 regularization work for this problem? If there is any issue or article related to these questions, would you please tell me the URL? Thank you. Regards, Yusuke -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Thu Nov 9 10:58:46 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Thu, 9 Nov 2017 16:58:46 +0100 Subject: [scikit-learn] =?iso-8859-1?q?New_core_devs=3A_Hanmin_Qin=2C_Gui?= =?iso-8859-1?q?llaume_Lema=EEtre=2C_and_Roman_Yurchak?= Message-ID: <20171109155846.GF1150313@phare.normalesup.org> Hi scikit-learn community, A week ago, we added 3 core developers, but I think that we forgot to announce it. So let me please welcome on board Hanmin Qin, Guillaume Lema?tre, and Roman Yurchak. They have been very active in the development of the project, and very helpful in the review process. It's a pleasure to see the team growing. Ga?l From olivier.grisel at ensta.org Thu Nov 9 11:36:22 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Thu, 9 Nov 2017 17:36:22 +0100 Subject: [scikit-learn] =?utf-8?q?New_core_devs=3A_Hanmin_Qin=2C_Guillaum?= =?utf-8?q?e_Lema=C3=AEtre=2C_and_Roman_Yurchak?= In-Reply-To: <20171109155846.GF1150313@phare.normalesup.org> References: <20171109155846.GF1150313@phare.normalesup.org> Message-ID: Congrats to all three of you! Thank you very much for your contributions and in particular in reviewing contributions by others. -- Olivier ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Fri Nov 10 03:34:56 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Fri, 10 Nov 2017 00:34:56 -0800 Subject: [scikit-learn] =?utf-8?q?New_core_devs=3A_Hanmin_Qin=2C_Guillaum?= =?utf-8?q?e_Lema=C3=AEtre=2C_and_Roman_Yurchak?= In-Reply-To: References: <20171109155846.GF1150313@phare.normalesup.org> Message-ID: Congrats! Welcome to the team, and thanks for your hard work so far. On Thu, Nov 9, 2017 at 8:36 AM, Olivier Grisel wrote: > Congrats to all three of you! Thank you very much for your contributions > and in particular in reviewing contributions by others. > > -- > Olivier > ? > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shane.grigsby at colorado.edu Tue Nov 14 18:27:12 2017 From: shane.grigsby at colorado.edu (Shane Grigsby) Date: Tue, 14 Nov 2017 16:27:12 -0700 Subject: [scikit-learn] Custom Distance Metric / Distance Matrix with K-means? Message-ID: <20171114232712.hkew6wjy2drarl7n@espgs-MacBook-Pro.local> Hello, I'd like to be able to cluster data using either k-means or mini-batch-kmeans for a toroidal geometry. I know that if I was using DBSCAN I could pass in a pre-computed distance matrix to do this; if I was using OPTICS I could pass in a 'metric' keyword for distance and specify a custom distance metric. Is this possible for K-means / minibatch-kmeans? I don't see distance metrics documented as possible keyword arguments... but perhaps they're allowed as **kwargs that pass to the underlying distance calculation call? Thanks, Shane -- *PhD candidate & Research Assistant* *Cooperative Institute for Research in Environmental Sciences (CIRES)* *University of Colorado at Boulder* From joel.nothman at gmail.com Tue Nov 14 18:50:16 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 15 Nov 2017 10:50:16 +1100 Subject: [scikit-learn] Custom Distance Metric / Distance Matrix with K-means? In-Reply-To: <20171114232712.hkew6wjy2drarl7n@espgs-MacBook-Pro.local> References: <20171114232712.hkew6wjy2drarl7n@espgs-MacBook-Pro.local> Message-ID: No, it's not applicable to KMeans. There are related algorithms that support custom metrics, e.g. K Medoids (a pull request to scikit-learn is here https://github.com/scikit-learn/scikit-learn/pull/7694 but implementations exist in other libraries). Cheers, Joel -------------- next part -------------- An HTML attachment was scrubbed... URL: From timo.erkkila at gmail.com Wed Nov 15 00:06:21 2017 From: timo.erkkila at gmail.com (=?UTF-8?Q?Timo_Erkkil=C3=A4?=) Date: Wed, 15 Nov 2017 07:06:21 +0200 Subject: [scikit-learn] Custom Distance Metric / Distance Matrix with K-means? In-Reply-To: References: <20171114232712.hkew6wjy2drarl7n@espgs-MacBook-Pro.local> Message-ID: Shall we finish that PR? :) I would have time to work on it again. I recall the only work left is to ensure the code works with the latest sklearn version. -Timo 15.11.2017 1.51 "Joel Nothman" kirjoitti: > No, it's not applicable to KMeans. There are related algorithms that > support custom metrics, e.g. K Medoids (a pull request to scikit-learn is > here https://github.com/scikit-learn/scikit-learn/pull/7694 but > implementations exist in other libraries). Cheers, Joel > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Wed Nov 15 00:17:18 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 15 Nov 2017 16:17:18 +1100 Subject: [scikit-learn] Custom Distance Metric / Distance Matrix with K-means? In-Reply-To: References: <20171114232712.hkew6wjy2drarl7n@espgs-MacBook-Pro.local> Message-ID: There was certainly not much more to do on #7694, but that's where Kornel Kie?czewski had taken on completing your work. I suppose you could take it back again! On 15 November 2017 at 16:06, Timo Erkkil? wrote: > Shall we finish that PR? :) I would have time to work on it again. I > recall the only work left is to ensure the code works with the latest > sklearn version. > > -Timo > > 15.11.2017 1.51 "Joel Nothman" kirjoitti: > >> No, it's not applicable to KMeans. There are related algorithms that >> support custom metrics, e.g. K Medoids (a pull request to scikit-learn is >> here https://github.com/scikit-learn/scikit-learn/pull/7694 but >> implementations exist in other libraries). Cheers, Joel >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shiduan at ucdavis.edu Thu Nov 16 03:18:30 2017 From: shiduan at ucdavis.edu (Shiheng Duan) Date: Thu, 16 Nov 2017 00:18:30 -0800 Subject: [scikit-learn] Issue with Sihouette_samples Message-ID: Hi all, I am doing cluster work and wanna use silhouette score to determine the number of clusters. But I got MemoryError when execute silhouette_samples. I searched it and found something related to numpy. But I cannot reproduce the numpy error. Is there any solution to it? The data is 621*1405*12. Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: From l.lomasto at innovationengineering.eu Thu Nov 16 04:14:02 2017 From: l.lomasto at innovationengineering.eu (Luigi Lomasto) Date: Thu, 16 Nov 2017 10:14:02 +0100 Subject: [scikit-learn] Issue with Sihouette_samples In-Reply-To: References: Message-ID: <609B3BDB-B8CA-4C82-A075-F9351FF0627E@innovationengineering.eu> Hi Shiudan, You can try to see this link: https://github.com/biolab/orange3/issues/1502 You have 3D dimensional problem, right? For each feature you have 12 values, so probably your RAM is small. How much RAM has your pc? Let me know, Luigi > Il giorno 16 nov 2017, alle ore 09:18, Shiheng Duan ha scritto: > > Hi all, > > I am doing cluster work and wanna use silhouette score to determine the number of clusters. But I got MemoryError when execute silhouette_samples. I searched it and found something related to numpy. But I cannot reproduce the numpy error. Is there any solution to it? > > The data is 621*1405*12. > > Thanks! > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From shiduan at ucdavis.edu Thu Nov 16 13:46:10 2017 From: shiduan at ucdavis.edu (Shiheng Duan) Date: Thu, 16 Nov 2017 10:46:10 -0800 Subject: [scikit-learn] Issue with Sihouette_samples In-Reply-To: <609B3BDB-B8CA-4C82-A075-F9351FF0627E@innovationengineering.eu> References: <609B3BDB-B8CA-4C82-A075-F9351FF0627E@innovationengineering.eu> Message-ID: Hi Luigi, Actually my data has 621*1405 points and each point has 12 features. I made it into a 2-D array and kmeans works well. The last time I ran it used 64G RAM on a cluster. I don't know how much more RAM can I use. BTW, 1502 issue is about Orange. Is it the same with sklearn? Thanks. On Thu, Nov 16, 2017 at 1:14 AM, Luigi Lomasto < l.lomasto at innovationengineering.eu> wrote: > Hi Shiudan, > > You can try to see this link: https://github.com/ > biolab/orange3/issues/1502 > > You have 3D dimensional problem, right? For each feature you have 12 > values, so probably your RAM is small. How much RAM has your pc? > Let me know, > > Luigi > > > Il giorno 16 nov 2017, alle ore 09:18, Shiheng Duan > ha scritto: > > Hi all, > > I am doing cluster work and wanna use silhouette score to determine the > number of clusters. But I got MemoryError when execute silhouette_samples. > I searched it and found something related to numpy. But I cannot reproduce > the numpy error. Is there any solution to it? > > The data is 621*1405*12. > > Thanks! > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Thu Nov 16 15:44:46 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Fri, 17 Nov 2017 07:44:46 +1100 Subject: [scikit-learn] Issue with Sihouette_samples In-Reply-To: References: <609B3BDB-B8CA-4C82-A075-F9351FF0627E@innovationengineering.eu> Message-ID: https://github.com/scikit-learn/scikit-learn/pull/7177 makes silhouette more memory-efficient. Try that branch? On 17 November 2017 at 05:46, Shiheng Duan wrote: > Hi Luigi, > > Actually my data has 621*1405 points and each point has 12 features. I > made it into a 2-D array and kmeans works well. The last time I ran it used > 64G RAM on a cluster. I don't know how much more RAM can I use. > > BTW, 1502 issue is about Orange. Is it the same with sklearn? > > Thanks. > > On Thu, Nov 16, 2017 at 1:14 AM, Luigi Lomasto innovationengineering.eu> wrote: > >> Hi Shiudan, >> >> You can try to see this link: https://github.com/biola >> b/orange3/issues/1502 >> >> You have 3D dimensional problem, right? For each feature you have 12 >> values, so probably your RAM is small. How much RAM has your pc? >> Let me know, >> >> Luigi >> >> >> Il giorno 16 nov 2017, alle ore 09:18, Shiheng Duan >> ha scritto: >> >> Hi all, >> >> I am doing cluster work and wanna use silhouette score to determine the >> number of clusters. But I got MemoryError when execute silhouette_samples. >> I searched it and found something related to numpy. But I cannot reproduce >> the numpy error. Is there any solution to it? >> >> The data is 621*1405*12. >> >> Thanks! >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From info at orges-leka.de Sat Nov 25 13:34:42 2017 From: info at orges-leka.de (Orges Leka) Date: Sat, 25 Nov 2017 19:34:42 +0100 Subject: [scikit-learn] Rapid Outlier Detection via Sampling Message-ID: Dear scikit-learn Developers, My Name is Orges Leka and I would like to implement "Rapid Outlier Detection via Sampling" [1] in scikit-learn. In R this method is already available [2] by the authors of the method. In Python I have not seen any implementation yet. The method is very simple yet effective as the authors show. First one selects say 20 points. Then computes the shortest distance of all other points to these 20 points. This is the outlier-score for one specific point. It would be nice to implement this with different metrics / distances (euclid, manhattan or other metrics) . How would I start the implementation? I have already git-cloned scikit-learn on my pc. Do I need to write object oriented or are functions also ok? If this succeeds, I would also like to extend the "example-outliers" doc with the above method. Kind regards Dipl. Math. Orges Leka [1] https://papers.nips.cc/paper/5127-rapid-distance-based-outlier-detection-via-sampling.pdf [2] https://github.com/mahito-sugiyama/sampling-outlier-detection -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Sat Nov 25 14:28:24 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Sat, 25 Nov 2017 20:28:24 +0100 Subject: [scikit-learn] Rapid Outlier Detection via Sampling In-Reply-To: References: Message-ID: <20171125192824.GG3969112@phare.normalesup.org> Dear Orges, I can see only 33 citations on Google scholar for this paper. As detailed in the inclusion criteria of scikit-learn: http://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms I am afraid that we need many more citations to include this algorithm. However, you could submit it for inclusion to scikit-learn-contrib: http://contrib.scikit-learn.org/ Best, Ga?l On Sat, Nov 25, 2017 at 07:34:42PM +0100, Orges Leka wrote: > Dear scikit-learn Developers, > My Name is Orges Leka and I would like to implement? > "Rapid Outlier Detection via Sampling" [1] in scikit-learn. > In R this method is already available [2] by the authors of the method. > In Python I have not seen any implementation yet. The method is very simple yet > effective as the authors show. First one selects say 20 points. Then computes > the shortest distance of all other points to these 20 points. This is the > outlier-score for one specific point.? > It would be nice to implement this with different metrics / distances (euclid, > manhattan or other metrics) . > How would I start the implementation? I have already git-cloned scikit-learn on > my pc. Do I need to write object oriented or are functions also ok? > If this succeeds, I would also like to extend the "example-outliers" doc with > the above method. > Kind regards > Dipl. Math. Orges Leka > [1]?https://papers.nips.cc/paper/ > 5127-rapid-distance-based-outlier-detection-via-sampling.pdf > [2] https://github.com/mahito-sugiyama/sampling-outlier-detection > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From olivier.grisel at ensta.org Mon Nov 27 03:45:22 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Mon, 27 Nov 2017 09:45:22 +0100 Subject: [scikit-learn] Rapid Outlier Detection via Sampling In-Reply-To: <20171125192824.GG3969112@phare.normalesup.org> References: <20171125192824.GG3969112@phare.normalesup.org> Message-ID: > Do I need to write object oriented or are functions also ok? I you want to contribute an implementation as a new project on scikit-learn contrib, you should be careful to follow the scikit-learn estimators API: http://scikit-learn.org/dev/developers/contributing.html#apis-of-scikit-learn-objects For outlier detection in particular, you should make sure your new estimator is consistent with the API conventions of other methods already in scikit-learn: http://scikit-learn.org/dev/modules/outlier_detection.html One of the primary goals of the scikit-learn ecosystem is to provide a simple homogeneous API to a very heterogeneous set of methods. -- Olivier ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeff1evesque at yahoo.com Mon Nov 27 18:26:26 2017 From: jeff1evesque at yahoo.com (Jeffrey Levesque) Date: Mon, 27 Nov 2017 18:26:26 -0500 Subject: [scikit-learn] Jeff Levesque: sklearn + D3JS Message-ID: <67A950D6-C54E-4353-AE28-A6D03EAE4AEC@yahoo.com> Hi, I'm developing an API for sklearn: - https://github.com/jeff1evesque/machine-learning I was wondering if anyone had integrated visualization tools, like D3JS, with results from sklearn predictions? If so, would any of you be willing to show how the backend results was piped into JavaScript? PS. If anyone is willing to contribute, or help in anyway, the codebase is BSD. Thank you, Jeff Levesque https://github.com/jeff1evesque From info at orges-leka.de Tue Nov 28 03:04:07 2017 From: info at orges-leka.de (Orges Leka) Date: Tue, 28 Nov 2017 09:04:07 +0100 Subject: [scikit-learn] 1. Re: Rapid Outlier Detection via Sampling (Olivier Grisel) Message-ID: Dear Olivier and Gael , Thank you for your answer. I started a request for inclusion in scikit-learn-contrib. The repo can be found here: https://github.com/orgesleka/rapid-outlier-detection Kind regards Orges Leka 2017-11-27 18:00 GMT+01:00 : > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Re: Rapid Outlier Detection via Sampling (Olivier Grisel) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 27 Nov 2017 09:45:22 +0100 > From: Olivier Grisel > To: Scikit-learn mailing list > Subject: Re: [scikit-learn] Rapid Outlier Detection via Sampling > Message-ID: > gmail.com> > Content-Type: text/plain; charset="utf-8" > > > Do I need to write object oriented or are functions also ok? > > I you want to contribute an implementation as a new project on scikit-learn > contrib, you should be careful to follow the scikit-learn estimators API: > > http://scikit-learn.org/dev/developers/contributing.html# > apis-of-scikit-learn-objects > > For outlier detection in particular, you should make sure your new > estimator is consistent with the API conventions of other methods already > in scikit-learn: > > http://scikit-learn.org/dev/modules/outlier_detection.html > > One of the primary goals of the scikit-learn ecosystem is to provide a > simple homogeneous API to a very heterogeneous set of methods. > > -- > Olivier > ? > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: attachments/20171127/d2d61329/attachment-0001.html> > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 20, Issue 9 > ******************************************* > -------------- next part -------------- An HTML attachment was scrubbed... URL: From info at orges-leka.de Thu Nov 30 03:21:12 2017 From: info at orges-leka.de (Orges Leka) Date: Thu, 30 Nov 2017 09:21:12 +0100 Subject: [scikit-learn] Webservice that uses scikit-learn Message-ID: Dear scikit-learn developers, I have developed a small webservice which can hold multiple scikit-learn models and serve post - json requests for prediction. A model must have model.metadata and must implement model.transform_predict(newdata). There are two examples: BostonModel, where only predict is overriden from WebModel IrisModel, where predict and transform is overriden from WebModel. The idea is, that while fitting a model, you could have some metadata which are needed for prediction. These metadata are stored as a python dictionary. metadata could hold for example: version of model when it was created additional pandas.DataFrames needed for prediction constants needed in the predict computation metrics about the model etc. The repo can be found here: https://github.com/orgesleka/webscikit It comes with two examples: iris and boston. The server can load other models at runtime, in case one is changing the models. The repo is meant as a proof of concept. If somebody has ideas on how to improve things or adding new features, that would be great. To get started, see: https://github.com/orgesleka/webscikit/wiki/Getting-started Kind regards Orges Leka -------------- next part -------------- An HTML attachment was scrubbed... URL: