From chris at aridas.eu Sun Apr 1 19:47:21 2018 From: chris at aridas.eu (Chris Aridas) Date: Mon, 2 Apr 2018 02:47:21 +0300 Subject: [scikit-learn] Get parameters of classes in a Pipeline within cross_validate In-Reply-To: References: Message-ID: Hi Roberto, One option it could be to make a wrapper and serialize your pipeline in your wrapper's fit method. After the serialization you could load the pipeline anytime and inspect whatever you want. I have coded an example in the following gist. https://gist.github.com/chkoar/2993a6e3f6bae1887eabc3fa27bb06a6 Best, Chris On Thu, Mar 29, 2018 at 12:16 PM, Roberto Guidotti wrote: > Hi scikit-learners, > > I have a simple Pipeline with Feature Selection and SVC classifier and I > use it in a cross validation schema with cross_validate / > cross_validation_score functions. > I need to extract the selected features for each fold of the CV and in > general get information about the fitted elements of the pipeline in each > of the CV fold. > > Is there a way to get these information (e.g. fs.get_support() or > fs.scores_) or I need to build my own cross_validate function? > > Thank you, > Roberto > > -- > Ing. Roberto Guidotti, PhD. > PostDoc Fellow > Institute for Advanced Biomedical Technologies - ITAB > Department of Neuroscience and Imaging > University of Chieti "G. D'Annunzio" > Via dei Vestini, 33 > 66013 Chieti, Italy > tel: +39 0871 3556919 > e-mail: r.guidotti at unich.it; rguidotti at acm.org > linkedin: http://it.linkedin.com/in/robertogui/ > twitter: @robbisg > github: https://github.com/robbisg > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris at aridas.eu Sun Apr 1 19:47:55 2018 From: chris at aridas.eu (Chris Aridas) Date: Mon, 2 Apr 2018 02:47:55 +0300 Subject: [scikit-learn] Error random_state parameter changed by estimator In-Reply-To: References: Message-ID: Hey Manoj, I think that the following link can help you to solve your problem. http://scikit-learn.org/stable/developers/contributing.html#random-numbers Best, Chris On Sat, Mar 31, 2018 at 5:38 AM, Manoj Karthick wrote: > I am working on adding a new estimator to the scikit-learn library, but > the make command always exits with the below error message: > > AssertionError: Estimator XYZ should not change or mutate the parameter random_state from 0 to during fit. > > > Can you help me understand what the issue is? > > Error log: > > self = > msg = ?Estimator XYZ should not change or mutate the parameter random_state from 0 to during fit.' > > def fail(self, msg=None): > """Fail immediately, with the given message."""> raise self.failureException(msg) > E AssertionError: Estimator XYZ should not change or mutate the parameter random_state from 0 to during fit. > > msg = 'Estimator XYZ should not change or mutate the parameter random_state from 0 to during fit.' > self = > > > > Thanks in advance, > Manoj Karthick Selva Kumar > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From randalljellis at gmail.com Sun Apr 1 21:36:43 2018 From: randalljellis at gmail.com (Randy Ellis) Date: Sun, 1 Apr 2018 21:36:43 -0400 Subject: [scikit-learn] NearestNeighbors without replacement Message-ID: Hello to the Scikit-learn community! I am doing case-control matching for an electronic health records study. My question is, is it possible to run Sklearn's NearestNeighbors function without replacement? As in, match the treated group to the untreated group without re-using any of the untreated group data points? If so, how? By default, it uses replacement. I know this because I tested it on some data of mine. The code I used is in the confirmed answer here: https://stats.stackexchange.com/questions/206832/matched-pairs-in-python-propensity-score-matching Thanks so much in advance, -- *Randall J. Ellis, B.S.* PhD Student, Biomedical Science, Mount Sinai Special Volunteer, http://www.michaelideslab.org/, NIDA IRP Cell: (954)-260-9891 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jakevdp at cs.washington.edu Sun Apr 1 22:13:01 2018 From: jakevdp at cs.washington.edu (Jacob Vanderplas) Date: Sun, 1 Apr 2018 19:13:01 -0700 Subject: [scikit-learn] NearestNeighbors without replacement In-Reply-To: References: Message-ID: On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis wrote: > Hello to the Scikit-learn community! > > I am doing case-control matching for an electronic health records study. > My question is, is it possible to run Sklearn's NearestNeighbors function > without replacement? As in, match the treated group to the untreated group > without re-using any of the untreated group data points? If so, how? By > default, it uses replacement. I know this because I tested it on some data > of mine. > > The code I used is in the confirmed answer here: > https://stats.stackexchange.com/questions/206832/matched- > pairs-in-python-propensity-score-matching > > Thanks so much in advance, > No, pairwise matching without replacement is not implemented within scikit-learn's nearest neighbors routines. It seems like an algorithm you would have to think carefully about because the number of potential pairs grows exponentially with the number of points, and I don't think it's true that choosing the nearest available neighbor of points in sequence will guarantee you to find the optimal configuration. You'd also have to carefully define what you mean by "optimal"... are you seeking to minimize the sum of all distances? The sum of squared distances? The maximum distance? The results would change depending on the metric you define. And you'd probably have to figure out some way to reduce the exponential search space in order to calculate the result in a reasonable amount of time for your data. You might look into the literature on propensity score matching; I think that's one area where this kind of neighbors-without-replacement algorithm is often used. Best, Jake > > -- > *Randall J. Ellis, B.S.* > PhD Student, Biomedical Science, Mount Sinai > Special Volunteer, http://www.michaelideslab.org/, NIDA IRP > Cell: (954)-260-9891 <(954)%20260-9891> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From randalljellis at gmail.com Mon Apr 2 13:47:51 2018 From: randalljellis at gmail.com (Randy Ellis) Date: Mon, 2 Apr 2018 13:47:51 -0400 Subject: [scikit-learn] NearestNeighbors without replacement In-Reply-To: References: Message-ID: Hi Jake, Thanks for the reply. Yes, trying this out resulted from looking for ways in python to implement propensity score matching. I found a package, pscore_match (http://www.kellieottoboni.com/pscore_match/), but the matching was really terrible. Specifically, I'm matching based on age, race, gender, HIV status, hepatitis C status, and sickle-cell disease status. Using NearestNeighbors for matching performed WAY better, I was so surprised at how well every factor was matched for. The only issue is that it uses replacement. Here's what I'm currently testing. I need each case to match to 20 controls, so since NearestNeighbors uses replacement, I'm matching each case to many controls (15000), taking all of the distances for all of the pairs, and retaining only the smallest distances for each control. Since many controls are re-used (since the algorithm uses replacement), the hope is that enough controls are matched to many different cases so that each case ends up being matched to 20 unique controls. Does this method make sense?? Best, Randy On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas wrote: > On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis > wrote: > >> Hello to the Scikit-learn community! >> >> I am doing case-control matching for an electronic health records study. >> My question is, is it possible to run Sklearn's NearestNeighbors function >> without replacement? As in, match the treated group to the untreated group >> without re-using any of the untreated group data points? If so, how? By >> default, it uses replacement. I know this because I tested it on some data >> of mine. >> >> The code I used is in the confirmed answer here: >> https://stats.stackexchange.com/questions/206832/matched-pai >> rs-in-python-propensity-score-matching >> >> Thanks so much in advance, >> > > No, pairwise matching without replacement is not implemented within > scikit-learn's nearest neighbors routines. > > It seems like an algorithm you would have to think carefully about because > the number of potential pairs grows exponentially with the number of > points, and I don't think it's true that choosing the nearest available > neighbor of points in sequence will guarantee you to find the optimal > configuration. You'd also have to carefully define what you mean by > "optimal"... are you seeking to minimize the sum of all distances? The sum > of squared distances? The maximum distance? The results would change > depending on the metric you define. And you'd probably have to figure out > some way to reduce the exponential search space in order to calculate the > result in a reasonable amount of time for your data. > > You might look into the literature on propensity score matching; I think > that's one area where this kind of neighbors-without-replacement algorithm > is often used. > > Best, > Jake > > >> >> -- >> *Randall J. Ellis, B.S.* >> PhD Student, Biomedical Science, Mount Sinai >> Special Volunteer, http://www.michaelideslab.org/, NIDA IRP >> Cell: (954)-260-9891 <(954)%20260-9891> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- *Randall J. Ellis, B.S.* PhD Student, Biomedical Science, Mount Sinai Special Volunteer, http://www.michaelideslab.org/, NIDA IRP Cell: (954)-260-9891 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jakevdp at cs.washington.edu Mon Apr 2 14:15:29 2018 From: jakevdp at cs.washington.edu (Jacob Vanderplas) Date: Mon, 2 Apr 2018 11:15:29 -0700 Subject: [scikit-learn] NearestNeighbors without replacement In-Reply-To: References:

Message-ID: Hi Randy, I think that approach is probably a good heuristic, but it will not necessarily find the optimal result. That said, if you don't care about having guarantees that you're finding the optimal pairing, but only that you can find a reasonable set of pairs, it will probably work out fine. Jake Jake VanderPlas Senior Data Science Fellow Director of Open Software University of Washington eScience Institute On Mon, Apr 2, 2018 at 10:47 AM, Randy Ellis wrote: > Hi Jake, > > Thanks for the reply. Yes, trying this out resulted from looking for ways > in python to implement propensity score matching. I found a package, > pscore_match (http://www.kellieottoboni.com/pscore_match/), but the > matching was really terrible. Specifically, I'm matching based on age, > race, gender, HIV status, hepatitis C status, and sickle-cell disease > status. Using NearestNeighbors for matching performed WAY better, I was so > surprised at how well every factor was matched for. The only issue is that > it uses replacement. > > Here's what I'm currently testing. I need each case to match to 20 > controls, so since NearestNeighbors uses replacement, I'm matching each > case to many controls (15000), taking all of the distances for all of the > pairs, and retaining only the smallest distances for each control. Since > many controls are re-used (since the algorithm uses replacement), the hope > is that enough controls are matched to many different cases so that each > case ends up being matched to 20 unique controls. Does this method make > sense?? > > Best, > > Randy > > On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas < > jakevdp at cs.washington.edu> wrote: > >> On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis >> wrote: >> >>> Hello to the Scikit-learn community! >>> >>> I am doing case-control matching for an electronic health records study. >>> My question is, is it possible to run Sklearn's NearestNeighbors function >>> without replacement? As in, match the treated group to the untreated group >>> without re-using any of the untreated group data points? If so, how? By >>> default, it uses replacement. I know this because I tested it on some data >>> of mine. >>> >>> The code I used is in the confirmed answer here: >>> https://stats.stackexchange.com/questions/206832/matched-pai >>> rs-in-python-propensity-score-matching >>> >>> Thanks so much in advance, >>> >> >> No, pairwise matching without replacement is not implemented within >> scikit-learn's nearest neighbors routines. >> >> It seems like an algorithm you would have to think carefully about >> because the number of potential pairs grows exponentially with the number >> of points, and I don't think it's true that choosing the nearest available >> neighbor of points in sequence will guarantee you to find the optimal >> configuration. You'd also have to carefully define what you mean by >> "optimal"... are you seeking to minimize the sum of all distances? The sum >> of squared distances? The maximum distance? The results would change >> depending on the metric you define. And you'd probably have to figure out >> some way to reduce the exponential search space in order to calculate the >> result in a reasonable amount of time for your data. >> >> You might look into the literature on propensity score matching; I think >> that's one area where this kind of neighbors-without-replacement algorithm >> is often used. >> >> Best, >> Jake >> >> >>> >>> -- >>> *Randall J. Ellis, B.S.* >>> PhD Student, Biomedical Science, Mount Sinai >>> Special Volunteer, http://www.michaelideslab.org/, NIDA IRP >>> Cell: (954)-260-9891 <(954)%20260-9891> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > *Randall J. Ellis, B.S.* > PhD Student, Biomedical Science, Mount Sinai > Special Volunteer, http://www.michaelideslab.org/, NIDA IRP > Cell: (954)-260-9891 <(954)%20260-9891> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From randalljellis at gmail.com Mon Apr 2 14:18:28 2018 From: randalljellis at gmail.com (Randy Ellis) Date: Mon, 2 Apr 2018 14:18:28 -0400 Subject: [scikit-learn] NearestNeighbors without replacement In-Reply-To: References:

Message-ID: Hi Jake, Thank you for the feedback. Yeah, working without replacement, certain cases are going to more appropriate matches than others. I proposed the idea of using replacement and compensating for the re-use of controls with frequency weighting, but you gotta do what your PI tells you sometimes! :P Best, Randy On Mon, Apr 2, 2018 at 2:15 PM, Jacob Vanderplas wrote: > Hi Randy, > I think that approach is probably a good heuristic, but it will not > necessarily find the optimal result. That said, if you don't care about > having guarantees that you're finding the optimal pairing, but only that > you can find a reasonable set of pairs, it will probably work out fine. > Jake > > Jake VanderPlas > Senior Data Science Fellow > Director of Open Software > University of Washington eScience Institute > > On Mon, Apr 2, 2018 at 10:47 AM, Randy Ellis > wrote: > >> Hi Jake, >> >> Thanks for the reply. Yes, trying this out resulted from looking for ways >> in python to implement propensity score matching. I found a package, >> pscore_match (http://www.kellieottoboni.com/pscore_match/), but the >> matching was really terrible. Specifically, I'm matching based on age, >> race, gender, HIV status, hepatitis C status, and sickle-cell disease >> status. Using NearestNeighbors for matching performed WAY better, I was so >> surprised at how well every factor was matched for. The only issue is that >> it uses replacement. >> >> Here's what I'm currently testing. I need each case to match to 20 >> controls, so since NearestNeighbors uses replacement, I'm matching each >> case to many controls (15000), taking all of the distances for all of the >> pairs, and retaining only the smallest distances for each control. Since >> many controls are re-used (since the algorithm uses replacement), the hope >> is that enough controls are matched to many different cases so that each >> case ends up being matched to 20 unique controls. Does this method make >> sense?? >> >> Best, >> >> Randy >> >> On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas < >> jakevdp at cs.washington.edu> wrote: >> >>> On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis >>> wrote: >>> >>>> Hello to the Scikit-learn community! >>>> >>>> I am doing case-control matching for an electronic health records >>>> study. My question is, is it possible to run Sklearn's NearestNeighbors >>>> function without replacement? As in, match the treated group to the >>>> untreated group without re-using any of the untreated group data points? If >>>> so, how? By default, it uses replacement. I know this because I tested it >>>> on some data of mine. >>>> >>>> The code I used is in the confirmed answer here: >>>> https://stats.stackexchange.com/questions/206832/matched-pai >>>> rs-in-python-propensity-score-matching >>>> >>>> Thanks so much in advance, >>>> >>> >>> No, pairwise matching without replacement is not implemented within >>> scikit-learn's nearest neighbors routines. >>> >>> It seems like an algorithm you would have to think carefully about >>> because the number of potential pairs grows exponentially with the number >>> of points, and I don't think it's true that choosing the nearest available >>> neighbor of points in sequence will guarantee you to find the optimal >>> configuration. You'd also have to carefully define what you mean by >>> "optimal"... are you seeking to minimize the sum of all distances? The sum >>> of squared distances? The maximum distance? The results would change >>> depending on the metric you define. And you'd probably have to figure out >>> some way to reduce the exponential search space in order to calculate the >>> result in a reasonable amount of time for your data. >>> >>> You might look into the literature on propensity score matching; I think >>> that's one area where this kind of neighbors-without-replacement algorithm >>> is often used. >>> >>> Best, >>> Jake >>> >>> >>>> >>>> -- >>>> *Randall J. Ellis, B.S.* >>>> PhD Student, Biomedical Science, Mount Sinai >>>> Special Volunteer, http://www.michaelideslab.org/, NIDA IRP >>>> Cell: (954)-260-9891 <(954)%20260-9891> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> *Randall J. Ellis, B.S.* >> PhD Student, Biomedical Science, Mount Sinai >> Special Volunteer, http://www.michaelideslab.org/, NIDA IRP >> Cell: (954)-260-9891 <(954)%20260-9891> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- *Randall J. Ellis, B.S.* PhD Student, Biomedical Science, Mount Sinai Special Volunteer, http://www.michaelideslab.org/, NIDA IRP Cell: (954)-260-9891 -------------- next part -------------- An HTML attachment was scrubbed... URL: From robbenson18 at gmail.com Tue Apr 3 04:44:53 2018 From: robbenson18 at gmail.com (Roberto Guidotti) Date: Tue, 3 Apr 2018 10:44:53 +0200 Subject: [scikit-learn] Get parameters of classes in a Pipeline within cross_validate In-Reply-To: References:

Message-ID: Hi Chris, Cool! I will try it very soon!! Thank you Roberto On 2 April 2018 at 01:47, Chris Aridas wrote: > Hi Roberto, > > One option it could be to make a wrapper and serialize your pipeline in > your wrapper's fit method. > After the serialization you could load the pipeline anytime and inspect > whatever you want. > I have coded an example in the following gist. > > https://gist.github.com/chkoar/2993a6e3f6bae1887eabc3fa27bb06a6 > > Best, > Chris > > > On Thu, Mar 29, 2018 at 12:16 PM, Roberto Guidotti > wrote: > >> Hi scikit-learners, >> >> I have a simple Pipeline with Feature Selection and SVC classifier and I >> use it in a cross validation schema with cross_validate / >> cross_validation_score functions. >> I need to extract the selected features for each fold of the CV and in >> general get information about the fitted elements of the pipeline in each >> of the CV fold. >> >> Is there a way to get these information (e.g. fs.get_support() or >> fs.scores_) or I need to build my own cross_validate function? >> >> Thank you, >> Roberto >> >> -- >> Ing. Roberto Guidotti, PhD. >> PostDoc Fellow >> Institute for Advanced Biomedical Technologies - ITAB >> Department of Neuroscience and Imaging >> University of Chieti "G. D'Annunzio" >> Via dei Vestini, 33 >> 66013 Chieti, Italy >> tel: +39 0871 3556919 <0871%20355%206919> >> e-mail: r.guidotti at unich.it; rguidotti at acm.org >> linkedin: http://it.linkedin.com/in/robertogui/ >> twitter: @robbisg >> github: https://github.com/robbisg >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Ing. Roberto Guidotti, PhD. PostDoc Fellow Institute for Advanced Biomedical Technologies - ITAB Department of Neuroscience and Imaging University of Chieti "G. D'Annunzio" Via dei Vestini, 33 66013 Chieti, Italy tel: +39 0871 3556919 e-mail: r.guidotti at unich.it; rguidotti at acm.org linkedin: http://it.linkedin.com/in/robertogui/ twitter: @robbisg github: https://github.com/robbisg -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Tue Apr 3 08:57:07 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 3 Apr 2018 14:57:07 +0200 Subject: [scikit-learn] NearestNeighbors without replacement In-Reply-To: References:

Message-ID: <20180403125707.GC1312094@phare.normalesup.org> Matching to minimize a cost is known as the linear assignment problem, can be solved in n^3 cost, and is implemented in scikit-learn in sklearn.utils.linear_assignment_.linear_assignment or in recent versions of scipy as scipy.optimize.linear_sum_assignment Of course, this problem will require much more coding (you need to build your pairwise cost matrix) and much more computing cost (n^3 instead of n^2) than a standard nearest-neighbor. Ga?l On Mon, Apr 02, 2018 at 01:47:51PM -0400, Randy Ellis wrote: > Hi Jake, > Thanks for the reply. Yes, trying this out resulted from looking for ways in > python to implement propensity score matching. I found a package, pscore_match > (http://www.kellieottoboni.com/pscore_match/), but the matching was really > terrible. Specifically, I'm matching based on age, race, gender, HIV status, > hepatitis C status, and sickle-cell disease status. Using NearestNeighbors for > matching performed WAY better, I was so surprised at how well every factor was > matched for. The only issue is that it uses replacement.? > Here's what I'm currently testing. I need each case to match to 20 controls, so > since NearestNeighbors uses replacement, I'm matching each case to many > controls (15000), taking all of the distances for all of the pairs, and > retaining only the smallest distances for each control. Since many controls are > re-used (since the algorithm uses replacement), the hope is that enough > controls are matched to many different cases so that each case ends up being > matched to 20 unique controls. Does this method make sense?? > Best, > Randy?? > On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas > wrote: > On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis > wrote: > Hello to the Scikit-learn community! > I am doing case-control matching for an electronic health records > study. My question is, is it possible to run Sklearn's NearestNeighbors > function without replacement? As in, match the treated group to the > untreated group without re-using any of the untreated group data > points? If so, how? By default, it uses replacement. I know this > because I tested it on some data of mine. > The code I used is in the confirmed answer here: https:// > stats.stackexchange.com/questions/206832/matched-pai > rs-in-python-propensity-score-matching > Thanks so much in advance, > No, pairwise matching without replacement is not implemented within > scikit-learn's nearest neighbors routines. > It seems like an algorithm you would have to think carefully about because > the number of potential pairs grows exponentially with the number of > points, and I don't think it's true that choosing the nearest available > neighbor of points in sequence will guarantee you to find the optimal > configuration. You'd also have to carefully define what you mean by > "optimal"... are you seeking to minimize the sum of all distances? The sum > of squared distances? The maximum distance? The results would change > depending on the metric you define. And you'd probably have to figure out > some way to reduce the exponential search space in order to calculate the > result in a reasonable amount of time for your data. > You might look into the literature on propensity score matching; I think > that's one area where this kind of neighbors-without-replacement algorithm > is often used. > Best, > ? ?Jake > ? -- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From randalljellis at gmail.com Tue Apr 3 09:00:32 2018 From: randalljellis at gmail.com (Randy Ellis) Date: Tue, 03 Apr 2018 13:00:32 +0000 Subject: [scikit-learn] NearestNeighbors without replacement In-Reply-To: <20180403125707.GC1312094@phare.normalesup.org> References:

<20180403125707.GC1312094@phare.normalesup.org> Message-ID: Thanks Dr. Varoquax, it?s awesome you?re on this list, I?m a fan of your work! Will look into this strategy. Best, Randy On Tue, Apr 3, 2018 at 8:57 AM Gael Varoquaux wrote: > Matching to minimize a cost is known as the linear assignment problem, > can be solved in n^3 cost, and is implemented in scikit-learn in > sklearn.utils.linear_assignment_.linear_assignment or in recent versions > of scipy as scipy.optimize.linear_sum_assignment > > Of course, this problem will require much more coding (you need to build > your pairwise cost matrix) and much more computing cost (n^3 instead of > n^2) than a standard nearest-neighbor. > > Ga?l > > On Mon, Apr 02, 2018 at 01:47:51PM -0400, Randy Ellis wrote: > > Hi Jake, > > > Thanks for the reply. Yes, trying this out resulted from looking for > ways in > > python to implement propensity score matching. I found a package, > pscore_match > > (http://www.kellieottoboni.com/pscore_match/), but the matching was > really > > terrible. Specifically, I'm matching based on age, race, gender, HIV > status, > > hepatitis C status, and sickle-cell disease status. Using > NearestNeighbors for > > matching performed WAY better, I was so surprised at how well every > factor was > > matched for. The only issue is that it uses replacement. > > > Here's what I'm currently testing. I need each case to match to 20 > controls, so > > since NearestNeighbors uses replacement, I'm matching each case to many > > controls (15000), taking all of the distances for all of the pairs, and > > retaining only the smallest distances for each control. Since many > controls are > > re-used (since the algorithm uses replacement), the hope is that enough > > controls are matched to many different cases so that each case ends up > being > > matched to 20 unique controls. Does this method make sense?? > > > Best, > > > Randy > > > On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas < > jakevdp at cs.washington.edu> > > wrote: > > > On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis > > > wrote: > > > Hello to the Scikit-learn community! > > > I am doing case-control matching for an electronic health records > > study. My question is, is it possible to run Sklearn's > NearestNeighbors > > function without replacement? As in, match the treated group to > the > > untreated group without re-using any of the untreated group data > > points? If so, how? By default, it uses replacement. I know this > > because I tested it on some data of mine. > > > The code I used is in the confirmed answer here: https:// > > stats.stackexchange.com/questions/206832/matched-pai > > rs-in-python-propensity-score-matching > > > Thanks so much in advance, > > > > No, pairwise matching without replacement is not implemented within > > scikit-learn's nearest neighbors routines. > > > It seems like an algorithm you would have to think carefully about > because > > the number of potential pairs grows exponentially with the number of > > points, and I don't think it's true that choosing the nearest > available > > neighbor of points in sequence will guarantee you to find the optimal > > configuration. You'd also have to carefully define what you mean by > > "optimal"... are you seeking to minimize the sum of all distances? > The sum > > of squared distances? The maximum distance? The results would change > > depending on the metric you define. And you'd probably have to > figure out > > some way to reduce the exponential search space in order to > calculate the > > result in a reasonable amount of time for your data. > > > You might look into the literature on propensity score matching; I > think > > that's one area where this kind of neighbors-without-replacement > algorithm > > is often used. > > > Best, > > Jake > > > -- > Gael Varoquaux > Senior Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- *Randall J. Ellis, B.S.* PhD Student, Biomedical Science, Mount Sinai Special Volunteer, http://www.michaelideslab.org/, NIDA IRP Cell: (954)-260-9891 -------------- next part -------------- An HTML attachment was scrubbed... URL: From randalljellis at gmail.com Tue Apr 3 09:58:25 2018 From: randalljellis at gmail.com (Randy Ellis) Date: Tue, 3 Apr 2018 09:58:25 -0400 Subject: [scikit-learn] NearestNeighbors without replacement In-Reply-To: <20180403125707.GC1312094@phare.normalesup.org> References:

<20180403125707.GC1312094@phare.normalesup.org> Message-ID: Hi Dr. Varoquaux, It seems like the SciPy function only assigns one row to one column. I need to assign 20 controls to each case. Does the linear_sum_assignment function, since it assigns unique pairs, depend on the order of the rows and columns? If so, perhaps I could shuffle and then combine the pairs together until each case has 20 unique controls. Any thoughts on this are greatly appreciated. Best, Randy On Tue, Apr 3, 2018 at 8:57 AM, Gael Varoquaux < gael.varoquaux at normalesup.org> wrote: > Matching to minimize a cost is known as the linear assignment problem, > can be solved in n^3 cost, and is implemented in scikit-learn in > sklearn.utils.linear_assignment_.linear_assignment or in recent versions > of scipy as scipy.optimize.linear_sum_assignment > > Of course, this problem will require much more coding (you need to build > your pairwise cost matrix) and much more computing cost (n^3 instead of > n^2) than a standard nearest-neighbor. > > Ga?l > > On Mon, Apr 02, 2018 at 01:47:51PM -0400, Randy Ellis wrote: > > Hi Jake, > > > Thanks for the reply. Yes, trying this out resulted from looking for > ways in > > python to implement propensity score matching. I found a package, > pscore_match > > (http://www.kellieottoboni.com/pscore_match/), but the matching was > really > > terrible. Specifically, I'm matching based on age, race, gender, HIV > status, > > hepatitis C status, and sickle-cell disease status. Using > NearestNeighbors for > > matching performed WAY better, I was so surprised at how well every > factor was > > matched for. The only issue is that it uses replacement. > > > Here's what I'm currently testing. I need each case to match to 20 > controls, so > > since NearestNeighbors uses replacement, I'm matching each case to many > > controls (15000), taking all of the distances for all of the pairs, and > > retaining only the smallest distances for each control. Since many > controls are > > re-used (since the algorithm uses replacement), the hope is that enough > > controls are matched to many different cases so that each case ends up > being > > matched to 20 unique controls. Does this method make sense?? > > > Best, > > > Randy > > > On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas < > jakevdp at cs.washington.edu> > > wrote: > > > On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis > > > wrote: > > > Hello to the Scikit-learn community! > > > I am doing case-control matching for an electronic health records > > study. My question is, is it possible to run Sklearn's > NearestNeighbors > > function without replacement? As in, match the treated group to > the > > untreated group without re-using any of the untreated group data > > points? If so, how? By default, it uses replacement. I know this > > because I tested it on some data of mine. > > > The code I used is in the confirmed answer here: https:// > > stats.stackexchange.com/questions/206832/matched-pai > > rs-in-python-propensity-score-matching > > > Thanks so much in advance, > > > > No, pairwise matching without replacement is not implemented within > > scikit-learn's nearest neighbors routines. > > > It seems like an algorithm you would have to think carefully about > because > > the number of potential pairs grows exponentially with the number of > > points, and I don't think it's true that choosing the nearest > available > > neighbor of points in sequence will guarantee you to find the optimal > > configuration. You'd also have to carefully define what you mean by > > "optimal"... are you seeking to minimize the sum of all distances? > The sum > > of squared distances? The maximum distance? The results would change > > depending on the metric you define. And you'd probably have to > figure out > > some way to reduce the exponential search space in order to > calculate the > > result in a reasonable amount of time for your data. > > > You might look into the literature on propensity score matching; I > think > > that's one area where this kind of neighbors-without-replacement > algorithm > > is often used. > > > Best, > > Jake > > > -- > Gael Varoquaux > Senior Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- *Randall J. Ellis, B.S.* PhD Student, Biomedical Science, Mount Sinai Special Volunteer, http://www.michaelideslab.org/, NIDA IRP Cell: (954)-260-9891 -------------- next part -------------- An HTML attachment was scrubbed... URL: From alex at garel.org Wed Apr 4 07:33:29 2018 From: alex at garel.org (Alex Garel) Date: Wed, 4 Apr 2018 12:33:29 +0100 Subject: [scikit-learn] Outliers removal Message-ID: <969838c7-fd81-8490-9cda-3ee726d68301@garel.org> Hello, First, thanks for the fantastic scikit-learn library. I have the following use case: For a classification problem, I have a list of sentences and use word2vec and a method (eg. mean, or weigthed mean, or attention and mean) to transform sentences to vectors. Because my dataset is very noisy, I may come with sentences full of words that are not part of word2vec, hence I can't vectorize them. I would like to remove those sentences from my dataset X, but this would mean removing also the corresponding target classes in y. Afaik, scikit-learn does not implement this possibility. I've seen a couple of issues about that, but they all seems stalled : https://github.com/scikit-learn/scikit-learn/issues/9630, https://github.com/scikit-learn/scikit-learn/issues/3855, https://github.com/scikit-learn/scikit-learn/pull/4552, https://github.com/scikit-learn/scikit-learn/issues/4143 I would like to be able to search for hyper-parameters in a simple way, so I really would like to be able to use a single pipeline taking text as input. My actual conclusion is this one : * vectorizer should return None for bad samples (or a specific vector, like numpy.zeros, or add an extra column marking valid/invalid samples) * make all my transformers down the pipeline accept for those entries and leave them untouched (can be done with a generic wrapper class) * have a wrapper around my classifier, to avoid fitting on those, like jnothman suggested here https://github.com/scikit-learn/scikit-learn/issues/9630#issuecomment-325202441 Its a bit tedious, but I can see it working. Is there any better suggestion ? -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 195 bytes Desc: OpenPGP digital signature URL: From g.lemaitre58 at gmail.com Wed Apr 4 07:53:04 2018 From: g.lemaitre58 at gmail.com (Guillaume Lemaitre) Date: Wed, 04 Apr 2018 13:53:04 +0200 Subject: [scikit-learn] Outliers removal In-Reply-To: <969838c7-fd81-8490-9cda-3ee726d68301@garel.org> References: <969838c7-fd81-8490-9cda-3ee726d68301@garel.org> Message-ID: <20180404115304.5124179.93165.52424@gmail.com> An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Apr 6 16:09:09 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 6 Apr 2018 16:09:09 -0400 Subject: [scikit-learn] (no subject) In-Reply-To: References: Message-ID: Try this: https://jakevdp.github.io/PythonDataScienceHandbook/ On 03/28/2018 11:49 PM, PARK Jinwoo wrote: > Dear scikit-learn experts > > Hello, I am a graduate school student majoring in doping control > analysis in Korea. > Now I'm in a research institute that carries out doping control analyses. > > I received a project by my advising doctor. It's about operating an AI project. > A workshop is scheduled in April, so it needs to be done in a month. > However, I haven't learn computer science at all and I'm totally ignorant of it. > So I desperately need your advice. > > To be specific, the 3 xml files shown in the picture are analysis results > named positive, negative, and unknown from top to bottom. > We'd like to let AI learn positive and negative data, > input unknown datum, and then see what result will turn out. > > I came to know that there's a module called 'iris calssification' in > scikit-learn > and I'm thinking of utilizing that as it seems similar with my assignment > However, while the database of iris is a csv file with 150 data and > labels inside, > what I have are 3 xml files each one of which represents one data, > which are stored in C:\Users\Jinwoo\Documents\Python Scripts\mzdata > The training process is not shuffling randomly the 150 data and > dividing into training set and test set. The data are already assigned > into training ones and testing one. > Also, when training the program, training labels naming positive and > negative should be inserted on my own. > > What I know all is that it will be appropriate to use fit() function > and predict() function to train and test. > But I have no idea on what to import, how to write codes correctly, and so on > > It will be thankful to give me some help > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Apr 6 16:16:17 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 6 Apr 2018 16:16:17 -0400 Subject: [scikit-learn] Get parameters of classes in a Pipeline within cross_validate In-Reply-To: References: Message-ID: <0e9f1edb-6e48-4d37-46f4-f102eb868e58@gmail.com> This is implemented in the current development version: https://github.com/scikit-learn/scikit-learn/pull/9686 On 03/29/2018 05:16 AM, Roberto Guidotti wrote: > Hi scikit-learners, > > I have a simple Pipeline with Feature Selection and SVC classifier and > I use it in a cross validation schema with cross_validate / > cross_validation_score functions. > I need to extract the selected features for each fold of the CV and in > general get information about the fitted elements of the pipeline in > each of the CV fold. > > Is there a way to get these information (e.g. fs.get_support() or > fs.scores_) or I need to build my own cross_validate function? > > Thank you, > Roberto > > -- > Ing. Roberto Guidotti, PhD. > PostDoc Fellow > Institute for Advanced Biomedical Technologies - ITAB > Department of Neuroscience and Imaging > University of Chieti "G. D'Annunzio" > Via dei Vestini, 33 > 66013 Chieti, Italy > tel: +39 0871 3556919 > e-mail: r.guidotti at unich.it ; > rguidotti at acm.org > linkedin: http://it.linkedin.com/in/robertogui/ > twitter: @robbisg > github: https://github.com/robbisg > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From rekhi.manmohit at gmail.com Sun Apr 8 09:21:34 2018 From: rekhi.manmohit at gmail.com (Manmohit Rekhi) Date: Sun, 8 Apr 2018 18:51:34 +0530 Subject: [scikit-learn] Contributing ST-DBSCAN to the clustering module Message-ID: Hi Everyone, This is my first open source contribution and I am really excited about scikit-learn I was reading a paper on ST-DBSCAN and found that there are no good implementations of it online. Thought this could a good opportunity to contribute. Had a few questions before I started 1) Is there any problem with this algorithm being in scikit-learn? 2) Is there someone already working on this algorithms? 3) Any things i need to take care of while contributing(like code optimisation/libraries to be used)? link to the paper: https://www.sciencedirect.com/science/article/pii/ S0169023X06000218 Thanks, Manmohit > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Apr 10 10:56:11 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 10 Apr 2018 10:56:11 -0400 Subject: [scikit-learn] Contributing ST-DBSCAN to the clustering module In-Reply-To: References: Message-ID: <7bec697f-df30-27c5-f7fb-727b4387c8ec@gmail.com> Hi. It looks like the paper proposes two unrelated improvements, right? We don't really include algorithms that are specific for time-series data or spatio-temporal data. How does the other improvement compare against hdbscan? There is many many different clustering algorithms out there, and I don't think we want to implement all of them. It's not entirely clear to me when this algorithm improves over what we have already. Btw, adding a new algorithm to scikit-learn is probably not a good idea for a first contribution, see http://scikit-learn.org/dev/faq.html#how-can-i-contribute-to-scikit-learn Andy On 04/08/2018 09:21 AM, Manmohit Rekhi wrote: > Hi Everyone, > > This is my first open source contribution and I am really excited > about scikit-learn > I was reading a paper on ST-DBSCAN and found that there are no good > implementations > of it online. > > Thought this could a good opportunity to contribute. Had a few > questions before I started > 1) Is there any problem with this algorithm being in scikit-learn? > 2) Is there someone already working on this algorithms? > 3) Any things i need to take care of while contributing(like code > optimisation/libraries to be used)? > > link to the paper: > https://www.sciencedirect.com/science/article/pii/S0169023X06000218 > > > Thanks, > Manmohit > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From hamidizade.s at gmail.com Wed Apr 11 04:14:03 2018 From: hamidizade.s at gmail.com (S Hamidizade) Date: Wed, 11 Apr 2018 12:44:03 +0430 Subject: [scikit-learn] imbalanced data Message-ID: Hi Could you please let me know if the algorithms (including Robust Under-sampling, Cluster-Classify, MKL for high-class skew, ...) discussed in the following thesis have been implemented in Python? Robust Learning with Highly Skewed Category Distributions by Selen Uguroglu *https://pdfs.semanticscholar.org/c792/d83d78ff10b7a944d2c4e534c2c55bdf59b8.pdf * Best regards, -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris at aridas.eu Wed Apr 11 05:10:46 2018 From: chris at aridas.eu (Chris Aridas) Date: Wed, 11 Apr 2018 09:10:46 +0000 Subject: [scikit-learn] imbalanced data In-Reply-To: References: Message-ID: Hola, You should check out http://imbalanced-learn.org Best, Chris On Wed, 11 Apr 2018 11:22 S Hamidizade, wrote: > Hi > > Could you please let me know if the algorithms (including Robust > Under-sampling, Cluster-Classify, MKL for high-class skew, ...) discussed > in the following thesis have been implemented in Python? > Robust Learning with Highly Skewed Category Distributions by Selen Uguroglu > *https://pdfs.semanticscholar.org/c792/d83d78ff10b7a944d2c4e534c2c55bdf59b8.pdf > * > > > > Best regards, > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From drraph at gmail.com Thu Apr 12 15:22:44 2018 From: drraph at gmail.com (Raphael C) Date: Thu, 12 Apr 2018 20:22:44 +0100 Subject: [scikit-learn] Finding a single cluster in 1d data Message-ID: I have a set of points in 1d represented by a list X of floating point numbers. The list has one dense section and the rest is sparse and I want to find the dense part. I can't release the actual data but here is a simulation: N = 100 start = 0 points = [] rate = 0.1 for i in range(N): points.append(start) start = start + random.expovariate(rate) rate = 10 for i in range(N*10): points.append(start) start = start + random.expovariate(rate) rate = 0.1 for i in range(N): points.append(start) start = start + random.expovariate(rate) plt.hist(points, bins = 100) plt.show() I would like to use scikit learn to find the dense region. This feels a little like outlier detection or the task of finding one cluster with noise. Is there a suitable method in scikit learn for this task? Raphael From pedropazzini at gmail.com Thu Apr 12 21:19:54 2018 From: pedropazzini at gmail.com (Pedro Pazzini) Date: Thu, 12 Apr 2018 22:19:54 -0300 Subject: [scikit-learn] Finding a single cluster in 1d data In-Reply-To: References: Message-ID: Hi Raphael. An option to highlight a dense region in your vector is to use a density estimator (http://scikit-learn.org/stable/modules/density.html). But I think that the python module jenkspy ( https://pypi.python.org/pypi/jenkspy and https://github.com/mthh/jenkspy) can help you also. The method finds the natural breaks of data in 1d ( https://en.wikipedia.org/wiki/Jenks_natural_breaks_optimization). I think that if you find a good value for the 'nb_class' parameter you can separate the dense region of your data from the sparse one. K-means is a generalization of Jenks break optimization for multivariate data, so, maybe, you could use the K-means module of scikit-learn for that also. On this approach, personally, I think the jenskpy module more straightforward. I hope it helps. Pedro Pazzini 2018-04-12 16:22 GMT-03:00 Raphael C : > I have a set of points in 1d represented by a list X of floating point > numbers. The list has one dense section and the rest is sparse and I > want to find the dense part. I can't release the actual data but here > is a simulation: > > N = 100 > > start = 0 > points = [] > rate = 0.1 > for i in range(N): > points.append(start) > start = start + random.expovariate(rate) > rate = 10 > for i in range(N*10): > points.append(start) > start = start + random.expovariate(rate) > rate = 0.1 > for i in range(N): > points.append(start) > start = start + random.expovariate(rate) > plt.hist(points, bins = 100) > plt.show() > > I would like to use scikit learn to find the dense region. This feels > a little like outlier detection or the task of finding one cluster > with noise. > > Is there a suitable method in scikit learn for this task? > > Raphael > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuel.castejon at gmail.com Fri Apr 13 06:18:13 2018 From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=) Date: Fri, 13 Apr 2018 12:18:13 +0200 Subject: [scikit-learn] Pipegraph feedback Message-ID: Hi all! As you know by now :-) we submitted PipeGraph as a contrib-project proposal. We believe that this tool can be interesting not only for end users wanting to encapsulate their arbitrarily complex workflows but also for sklearn developers as some internal developments could be easily expressed as pipegraphs, for example, ensemble methods. We would love to have some feedback in terms of: - whether we would have to change anything in order to be more in line with sklearn's philosophy - any development you core developers are working on that could be treated as a pipegraph - possible scenarios not implemented yet by pipegraph, such as recurrent graphs, that might be potentially useful. Moreover, in case any core developer is interested in joining the project you are more than welcome! This would provide a great opportunity for collaboration! Best wishes Manuel Castej?n-Limas -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlopez at ende.cc Fri Apr 13 11:51:29 2018 From: jlopez at ende.cc (=?UTF-8?Q?Javier_L=C3=B3pez?=) Date: Fri, 13 Apr 2018 15:51:29 +0000 Subject: [scikit-learn] Delegating "get_params" and "set_params" to a wrapped estimator when parameter is not defined. Message-ID: I have a class `FancyEstimator(BaseEstimator, MetaEstimatorMixin): ...` that wraps around an arbitrary sklearn estimator to add some functionality I am interested about. This class contains an attribute `self.estimator` that contains the wrapped estimator. Delegation of the main methods, such as `fit`, `transform` works just fine, but I am having some issues with `get_params` and `set_params`. The main idea is, I would like to use my wrapped class as a drop-in replacement for the original estimator, but this raises some issues with some functions that try using the `get_params` and `set_params` straight in my class, as the original parameters now have prefixed names (for instance `estimator__verbose` instead of `verbose`) I would like to delegate calls of set_params and get_params in a smart way so that if a parameter is unknown for my wrapper class, then it automatically goes looking for it in the wrapped estimator. I am not concerned about my class parameter names as there are only a couple of very specific names on it, so it is safe to assume that any unknown parameter name should refer to the base estimator. Is there an easy way of doing that? Cheers, J -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Apr 13 12:50:04 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 13 Apr 2018 12:50:04 -0400 Subject: [scikit-learn] Delegating "get_params" and "set_params" to a wrapped estimator when parameter is not defined. In-Reply-To: References: <176b908f-0fdd-c883-8b7f-2ac8378cfe1c@gmail.com> Message-ID: Please stay on the mailing list :) I'm not sure if ValueError is the right error that would be raised (I think it's not). And it's hard to say if this will break something in some edge cases. I would probably rather explicitly encode the parameters of FancyEstimator instead of the try, or get them using super.get_params. You also should rewrite get_params. In principle something like that should work, but I wouldn't go as far as saying it's "safe" and you should test it extensively. On 04/13/2018 12:41 PM, Javier L?pez wrote: > Is something like this safe, or might I be breaking some important > functionality? > > ``` > ? ? def set_params(self, **params): > ? ? ? ? try: > ? ? ? ? ? ? super(FancyEstimator, self).set_params(**params) > ? ? ? ? except ValueError: > ? ? ? ? ? ? self.wrapped_estimator_.set_params(**params) > ``` > > > On Fri, Apr 13, 2018 at 5:05 PM Andreas Mueller > wrote: > > You just need to implement get_params and set_params yourself to > delegate in this way, right? > > > On 04/13/2018 11:51 AM, Javier L?pez wrote: >> I have a class >> `FancyEstimator(BaseEstimator,?MetaEstimatorMixin): ...` that wraps >> around an arbitrary sklearn estimator to add some functionality I >> am interested about. >> This class contains an attribute `self.estimator` that contains >> the wrapped estimator. >> Delegation of the main methods, such as `fit`, `transform` works >> just fine, but I am >> having some issues with `get_params` and `set_params`. >> >> The main idea is, I would like to use my wrapped class as a >> drop-in replacement for >> the original estimator, but this raises some issues with some >> functions >> that try using the `get_params` and `set_params` straight in my >> class, as the original >> parameters now have prefixed names (for instance >> `estimator__verbose` instead of `verbose`) >> I would like to delegate calls of set_params and get_params in a >> smart way so that if a >> parameter is unknown for my wrapper class, then it automatically >> goes looking for it in >> the wrapped estimator. >> >> ?I am not concerned about my class parameter names as there are >> only a couple of very >> specific names on it, so it is safe to assume that any unknown >> parameter name should >> refer to the base estimator. Is there an easy way of doing that? >> >> Cheers, >> J >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zane.dufour at gmail.com Fri Apr 13 21:35:59 2018 From: zane.dufour at gmail.com (Zane DuFour) Date: Sat, 14 Apr 2018 01:35:59 +0000 Subject: [scikit-learn] K Medoids Clustering Implementation Message-ID: Is someone working on an implementation of K-Medoids clustering at the moment? If not, I would like to implement it in sklearn. Thanks, Zane Dufour -------------- next part -------------- An HTML attachment was scrubbed... URL: From drraph at gmail.com Sat Apr 14 06:06:05 2018 From: drraph at gmail.com (Raphael C) Date: Sat, 14 Apr 2018 11:06:05 +0100 Subject: [scikit-learn] Finding a single cluster in 1d data In-Reply-To: References: Message-ID: Thank you very much! I didn't know about jenkspy. Raphael On 13 April 2018 at 02:19, Pedro Pazzini wrote: > Hi Raphael. > > An option to highlight a dense region in your vector is to use a density > estimator (http://scikit-learn.org/stable/modules/density.html). > > But I think that the python module jenkspy > (https://pypi.python.org/pypi/jenkspy and https://github.com/mthh/jenkspy) > can help you also. The method finds the natural breaks of data in 1d > (https://en.wikipedia.org/wiki/Jenks_natural_breaks_optimization). I think > that if you find a good value for the 'nb_class' parameter you can separate > the dense region of your data from the sparse one. > > K-means is a generalization of Jenks break optimization for multivariate > data, so, maybe, you could use the K-means module of scikit-learn for that > also. On this approach, personally, I think the jenskpy module more > straightforward. > > I hope it helps. > > Pedro Pazzini > > 2018-04-12 16:22 GMT-03:00 Raphael C : >> >> I have a set of points in 1d represented by a list X of floating point >> numbers. The list has one dense section and the rest is sparse and I >> want to find the dense part. I can't release the actual data but here >> is a simulation: >> >> N = 100 >> >> start = 0 >> points = [] >> rate = 0.1 >> for i in range(N): >> points.append(start) >> start = start + random.expovariate(rate) >> rate = 10 >> for i in range(N*10): >> points.append(start) >> start = start + random.expovariate(rate) >> rate = 0.1 >> for i in range(N): >> points.append(start) >> start = start + random.expovariate(rate) >> plt.hist(points, bins = 100) >> plt.show() >> >> I would like to use scikit learn to find the dense region. This feels >> a little like outlier detection or the task of finding one cluster >> with noise. >> >> Is there a suitable method in scikit learn for this task? >> >> Raphael >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From mcasl at unileon.es Sat Apr 14 12:08:50 2018 From: mcasl at unileon.es (=?UTF-8?Q?Manuel_CASTEJ=C3=93N_LIMAS?=) Date: Sat, 14 Apr 2018 16:08:50 +0000 Subject: [scikit-learn] Delegating "get_params" and "set_params" to a wrapped estimator when parameter is not defined. In-Reply-To: References: Message-ID: Hi Javier! Yo can have a look at: https://github.com/mcasl/PipeGraph/blob/master/pipegraph/adapters.py There are a few adapters there and I had tool deal with that situation. I solved it by using __getattr__ and __setattr__. Best Manolo El vie., 13 abr. 2018 17:53, Javier L?pez escribi?: > I have a class `FancyEstimator(BaseEstimator, MetaEstimatorMixin): ...` > that wraps > around an arbitrary sklearn estimator to add some functionality I am > interested about. > This class contains an attribute `self.estimator` that contains the > wrapped estimator. > Delegation of the main methods, such as `fit`, `transform` works just > fine, but I am > having some issues with `get_params` and `set_params`. > > The main idea is, I would like to use my wrapped class as a drop-in > replacement for > the original estimator, but this raises some issues with some functions > that try using the `get_params` and `set_params` straight in my class, as > the original > parameters now have prefixed names (for instance `estimator__verbose` > instead of `verbose`) > I would like to delegate calls of set_params and get_params in a smart way > so that if a > parameter is unknown for my wrapper class, then it automatically goes > looking for it in > the wrapped estimator. > > I am not concerned about my class parameter names as there are only a > couple of very > specific names on it, so it is safe to assume that any unknown parameter > name should > refer to the base estimator. Is there an easy way of doing that? > > Cheers, > J > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Apr 15 09:18:19 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Sun, 15 Apr 2018 23:18:19 +1000 Subject: [scikit-learn] Delegating "get_params" and "set_params" to a wrapped estimator when parameter is not defined. In-Reply-To: References: Message-ID: Have you considered whether a mixin is a better model than a wrapper?? -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Apr 15 09:21:08 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Sun, 15 Apr 2018 23:21:08 +1000 Subject: [scikit-learn] K Medoids Clustering Implementation In-Reply-To: References: Message-ID: Did you find https://github.com/scikit-learn/scikit-learn/pull/7694? On 14 April 2018 at 11:35, Zane DuFour wrote: > Is someone working on an implementation of K-Medoids clustering > at the moment? If not, I would > like to implement it in sklearn. > > Thanks, > Zane Dufour > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Apr 15 19:19:16 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 16 Apr 2018 09:19:16 +1000 Subject: [scikit-learn] K Medoids Clustering Implementation In-Reply-To: References:

Message-ID: The current contributor for this is finding it hard to find time to complete the work. I think the remaining issues are quite minor, and we would be keen for someone to take over: respond to my review and hope we get another before next release. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mcasl at unileon.es Mon Apr 16 08:21:44 2018 From: mcasl at unileon.es (=?UTF-8?Q?Manuel_CASTEJ=C3=93N_LIMAS?=) Date: Mon, 16 Apr 2018 14:21:44 +0200 Subject: [scikit-learn] Delegating "get_params" and "set_params" to a wrapped estimator when parameter is not defined. In-Reply-To: References: Message-ID: Nope! Mostly because of lack of experience with mixins. I've done some reading and I think I can come up with a few mixins doing the trick by dynamically adding their methods to an already instantiated object. I'll play with that and I hope to show you something soon! Or at least I will have better grounds to make an educated decision. Best Manuel Manuel Castej?n Limas *Escuela de Ingenier?a Industrial e Inform?tica* Universidad de Le?n Campus de Vegazana sn. 24071. Le?n. Spain. *e-mail: *manuel.castejon at unileon.es *Tel.*: +34 987 291 946 Digital Business Card: Click Here 2018-04-15 15:18 GMT+02:00 Joel Nothman : > Have you considered whether a mixin is a better model than a wrapper?? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlopez at ende.cc Mon Apr 16 09:33:08 2018 From: jlopez at ende.cc (=?UTF-8?Q?Javier_L=C3=B3pez?=) Date: Mon, 16 Apr 2018 13:33:08 +0000 Subject: [scikit-learn] Delegating "get_params" and "set_params" to a wrapped estimator when parameter is not defined. In-Reply-To: References: Message-ID: How could I make mixins work in this case? If I define the class `FancyEstimatorMixin`, in order to get a drop-in replacement for a sklearn object wouldn't I need to monkey-patch the scikit-learn `BaseEstimator` class to inherit from my mixin? Or am I misunderstanding something? (BTW monkey-patching is one of the things I am trying to avoid in a "clean" solution) On Mon, Apr 16, 2018 at 1:24 PM Manuel CASTEJ?N LIMAS via scikit-learn < scikit-learn at python.org> wrote: > Nope! Mostly because of lack of experience with mixins. > I've done some reading and I think I can come up with a few mixins doing > the trick by dynamically adding their methods to an already instantiated > object. I'll play with that and I hope to show you something soon! Or at > least I will have better grounds to make an educated decision. > Best > Manuel > > 2018-04-15 15:18 GMT+02:00 Joel Nothman : > >> Have you considered whether a mixin is a better model than a wrapper?? >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlopez at ende.cc Mon Apr 16 09:49:28 2018 From: jlopez at ende.cc (=?UTF-8?Q?Javier_L=C3=B3pez?=) Date: Mon, 16 Apr 2018 13:49:28 +0000 Subject: [scikit-learn] Delegating "get_params" and "set_params" to a wrapped estimator when parameter is not defined. In-Reply-To: References: Message-ID: Hi Manolo! Your code looks nice, but my use case is a bit different. I have a mixed set of parameters, some come from my wrapper, and some from the wrapped estimator. The logic I am going for is something like "If you know about this parameter, then deal with it, if not, then pass it along to the wrapped estimator and hope for the best!" which is why I was asking Andreas about the use of `super`. Joel is right that a mixin would be the natural way of adding functionality, but unless I am getting something wrong that would require me modifying the base classes from sklearn by either forking the code (which sounds like a lot of trouble) or by monkey-patching the import (also not an ideal solution). There are several wrappers in scikit-learn that are similar in spirit to what I am trying to do, for instalce `CalibratedClassifierCV`, but neither of them deal with the `set_params` and `get_params` in a way that delegates to the base estimator, which makes them unsuitable for use in some third party tools such as BorutaPy [1] which expects an estimator where `estimator.set_params("n_estimators")` makes sense. Cheers, Javier [1] https://github.com/scikit-learn-contrib/boruta_py On Sat, Apr 14, 2018 at 5:10 PM Manuel CASTEJ?N LIMAS via scikit-learn < scikit-learn at python.org> wrote: > Hi Javier! > Yo can have a look at: > > https://github.com/mcasl/PipeGraph/blob/master/pipegraph/adapters.py > > There are a few adapters there and I had tool deal with that situation. I > solved it by using __getattr__ and __setattr__. > Best > Manolo > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From spssachdeva21695 at gmail.com Mon Apr 16 16:07:03 2018 From: spssachdeva21695 at gmail.com (Sidak Pal Singh) Date: Mon, 16 Apr 2018 22:07:03 +0200 Subject: [scikit-learn] Issues with kmeans: Difference in centroid values Message-ID: Hi everyone, I was using scikit-learn KMeans algorithm to cluster pretrained word-vectors. There are a few things which I found to be surprising and wanted to get some feedback on. - Based upon the 'labels_' assigned to each word-vector (i.e. cluster memberships), I compute every cluster centroid as the average of the word-vectors (corresponding to that cluster). Surprisingly, this seems to be pretty different from the 'cluster_centers_'. Is there anything that I am missing here? - I was later using the verbose option to see if the clustering has converged or not. I saw on the console log messages such as *"**center shift 7.994126e-04 within tolerance 1.243425e-06"*. It seems that this corresponds to some code in *kmeans_elkan.pyx* ( https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/_k_means_elkan.pyx ). - Lastly, another thing that seems strange is that I hadn't set the tolerance value. So the default of 1e-4 should have been used. But if you look again at the above log, it says *within tolerance 1.243425e-06 instead of 1e-4. * It would be great if you can look into this and help me out. Thank you so much! :) Best, Sidak Pal Singh EPFL -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Apr 16 17:36:14 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 16 Apr 2018 17:36:14 -0400 Subject: [scikit-learn] Issues with kmeans: Difference in centroid values In-Reply-To: References: Message-ID: <16b5ed51-c029-bbc1-ab5f-9edd17fb9dd6@gmail.com> On 04/16/2018 04:07 PM, Sidak Pal Singh wrote: > Hi everyone, > > I was using scikit-learn KMeans algorithm to cluster pretrained > word-vectors. There are a few things which I found to be surprising > and wanted to get some feedback on. > > - Based upon the 'labels_' assigned to each word-vector (i.e. cluster > memberships), I compute every cluster centroid as the average of the > word-vectors (corresponding to that cluster). Surprisingly, this seems > to be pretty different from the 'cluster_centers_'. Is there anything > that I am missing here? If the algorithm did not fully converge, you just did one more step, so the results are expected to be different. > > - I was later using the verbose option to see if the clustering has > converged or not. I saw on the console log messages such as /"//center > shift 7.994126e-04 within tolerance 1.243425e-06"/. It seems that this > corresponds to some code in *kmeans_elkan.pyx* > (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/_k_means_elkan.pyx). > > - Lastly, another thing that seems strange is that I hadn't set the > tolerance value. So the default of 1e-4 should have been used. But if > you look again at the above log, it says /within tolerance > 1.243425e-06 instead of 1e-4. > / /https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/k_means_.py#L159 The tolerance is scaled by the variance of the data to be independent of the scal/e -------------- next part -------------- An HTML attachment was scrubbed... URL: From mlcnworkshop at gmail.com Wed Apr 18 05:22:04 2018 From: mlcnworkshop at gmail.com (MLCN Workshop) Date: Wed, 18 Apr 2018 11:22:04 +0200 Subject: [scikit-learn] CFP: The first International Workshop on Machine Learning in Clinical Neuroimaging (MLCN 2018) Message-ID: *** Apologize for multiple posting *** Dear Colleagues, We are delighted to invite you to join us for the MLCN 2018 workshop as a satellite event of MICCAI 2018 conference, Granada, Spain. ------------------------------------------------------------ ------------------------------------------------------ CALL FOR PAPERS: Recent advances in neuroimaging and statistical machine learning provide an exceptional opportunity for investigators and physicians to discover complex relationships between brain, behaviors, and mental and neurological disorders. MLCN 2018 workshop, as a satellite event of MICCAI 2018, aims to bring together researchers in both theory and application from various fields in domains of spatial statistics, pattern recognition in neuroimaging, and predictive clinical neuroscience. Topics of interests include but are not limited to: - Applications of spatio-temporal modeling in predictive clinical neuroscience - Spatial regularization in decoding clinical neuroimaging data - Spatial statistics in neuroimaging - Learning with structured inputs and outputs in clinical neuroscience - Multi-task learning in analyzing structured neuroimaging data - Deep learning in analyzing structured neuroimaging data - Model stability and interpretability in clinical neuroscience ------------------------------------------------------------ --------------------------------------------------------- CONFIRMED SPEAKERS: Christos Davatzikos (University of Pennsylvania) Ga?l Varoquaux (Parietal team, INRIA) Jian Kang (University of Michigan) ------------------------------------------------------------ ---------------------------------------------------------- SUBMISSION PROCESS: The workshop seeks high quality, original, and unpublished work on algorithms, theory, and applications of machine learning in clinical neuroimaging and spatially structured data analysis. Papers should be submitted electronically in Springer Lecture Notes in Computer Science (LCNS) style of up to 8-pages papers using the CMT system at https://cmt3.research.microsoft.com/MLCN2018. This workshop uses a double-blind review process in the evaluation phase, thus authors must ensure anonymous submissions. Accepted papers will be published in a joint proceeding with the MICCAI conference. ------------------------------------------------------------ ----------------------------------------------------------- IMPORTANT DATES: Paper submission deadline: June 11, 2018 Notification of Acceptance: July 16, 2018 Camera-ready Submission: July 23, 2018 Workshop Date: September 20, 2018 ------------------------------------------------------------ ------------------------------------------------------ Best regards, MLCN 2018 Organizing Committee, Email: mlcnworkshop at gmail.com Website: https://mlcn2018.com/ twitter: @MLCN2018 -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.balacek at gmail.com Wed Apr 18 15:15:54 2018 From: daniel.balacek at gmail.com (=?UTF-8?B?RGFuaWVsIEJhbMOhxI1law==?=) Date: Wed, 18 Apr 2018 21:15:54 +0200 Subject: [scikit-learn] MLPClassifier - Softmax activation function Message-ID: Hello everyone I have a question regarding MLPClassifier in sklearn. In the documentation in section *1.17. Neural network models (supervised) - 1.17.2 Classification* it is stated that "*MLPClassifier* supports multi-class classification by applying Softmax as the output function." However it is not clear how to apply the Softmax function. The way I think (or hope) this works is that if a parameter *activation *is set to *activation = 'logistic' *Softmax function should be automatically applied whenever there are more than two classes. Is this right or does one have to explicitly specify the use of Softmax function in some way? I am sorry if this is a nonsense question. I am new to scikit-learn and machine learning in general and I was not sure about this one. Thank you for any answers in advance. With regards, D. B. -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Wed Apr 18 15:26:44 2018 From: se.raschka at gmail.com (Sebastian Raschka) Date: Wed, 18 Apr 2018 15:26:44 -0400 Subject: [scikit-learn] MLPClassifier - Softmax activation function In-Reply-To: References: Message-ID: <9C483D01-2785-4ED9-97FE-0F58A3B5ADA8@gmail.com> That's a good question since the outputs would be differently scaled if the logistic sigmoid vs the softmax is used in the output layer. I think you don't need to worry about setting anything though, since the "activation" only applies to the hidden layers, and the softmax is, regardless of "activation," automatically used in the output layer. Best, Sebastian > On Apr 18, 2018, at 3:15 PM, Daniel Bal??ek wrote: > > Hello everyone > > I have a question regarding MLPClassifier in sklearn. In the documentation in section 1.17. Neural network models (supervised) - 1.17.2 Classification it is stated that "MLPClassifier supports multi-class classification by applying Softmax as the output function." > However it is not clear how to apply the Softmax function. > > The way I think (or hope) this works is that if a parameter activation is set to activation = 'logistic' Softmax function should be automatically applied whenever there are more than two classes. Is this right or does one have to explicitly specify the use of Softmax function in some way? > > I am sorry if this is a nonsense question. I am new to scikit-learn and machine learning in general and I was not sure about this one. Thank you for any answers in advance. > > With regards, > D. B. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From ajinnovator at gmail.com Thu Apr 19 22:05:30 2018 From: ajinnovator at gmail.com (ajay sahu) Date: Fri, 20 Apr 2018 07:35:30 +0530 Subject: [scikit-learn] Probability estimation Message-ID: Thanks for making such a brilliant scikit learn code , i have some doubts for a project , can you all tell me how to achieve that using scikit learn or any of your methods .. PROBLEM STATEMENT - i am trying to create a probability tool for creating scores of importance eg (9/10, 8/10 or 6/10 ) depending on the degree of importance ,from a book text corpus , i am trying to calculate the probability of certain topic coming in the exam using previous year question papers . i have around 10 yrs of question papers comprising of 300 question in total from the same syllabus . but books are different . since the questions are not straight forward , i am unable to do this WHAT HAVE I TRIED - i tried using topic modelling and LDA method in it ,but it gives only important topics or words from a text corpus . i was able to generate important topics from text , using GENSIM library but it couldn't solve my problem of matching and comparing it with the questions from the previous years question papers as some times questions were indirect or twisted . WHAT I AM HOPING - i am hoping is there a way scikit learn library can help me with understanding the indirect questions or twisted questions to generate a topic which may help in matching it with the topics in book corpus from the same syllabus thus generating a probability of topics occuring in the exam using previous years question papers . Looking forward for your help , any kind of help in solving this particular problem statement would be of great help to us Thanks -- - *Regards* Ajay -------------- next part -------------- An HTML attachment was scrubbed... URL: