From t3kcit at gmail.com Wed Mar 1 09:49:36 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 1 Mar 2017 09:49:36 -0500 Subject: [scikit-learn] Women in Machine Learning and Data Science Sprint next Weekend (also call for help) In-Reply-To: References: <314038f1-d325-1e0d-8399-aea6a4a47d95@gmail.com> Message-ID: <8ba7aadc-6d3c-a73b-e3f9-e0705d2c5e3d@gmail.com> Yes, on gitter: https://gitter.im/scikit-learn/wimlds On 02/28/2017 11:07 PM, Jacob Schreiber wrote: > Okay. I will be there. Is there going to be a chat channel of some > sort to organize things? > > On Tue, Feb 28, 2017 at 4:28 PM, Andreas Mueller > wrote: > > Thanks! > It's gonna be 9:30 till 4, but I'd be surprised if there's a lot > going on on the issue tracker before 11h with setup etc. > (EST that is). > > Andy > > > On 02/27/2017 11:58 PM, Jacob Schreiber wrote: >> I will try to carve out some time Saturday to review PRs. What >> time is it occuring? >> >> On Mon, Feb 27, 2017 at 8:50 PM, Andreas Mueller >> > wrote: >> >> Hey all. >> >> There's gonna be an introductory scikit-learn sprint at NYC >> on Saturday that a local Women's DS/ML group is organizing >> with me. >> I feel like we could do a bit more to improve (gender) >> diversity in the scipy/pydata space, and so I think this will >> be cool. >> >> If anyone wants to review code on Saturday that would be a >> great help for people getting started. >> Also, if anyone wants to help beforehand, making sure there >> is enough "easy" and "need contributor" issues tagged >> is important, as well as ensuring that all the tagged issues >> actually still need contributors. >> >> I'll try to do as much of these as I can but my time is >> limited these days :( >> >> Thanks y'all! >> >> Andy >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ scikit-learn > mailing list scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From raga.markely at gmail.com Wed Mar 1 15:07:05 2017 From: raga.markely at gmail.com (Raga Markely) Date: Wed, 1 Mar 2017 15:07:05 -0500 Subject: [scikit-learn] Confidence and Prediction Intervals of Support Vector Regression Message-ID: Hi everyone, I wonder if you could provide me with some suggestions on how to determine the confidence and prediction intervals of SVR? If you have suggestions for any machine learning algorithms in general, that would be fine too (doesn't have to be specific for SVR). So far, I have found: 1. Bootstrap: http://stats.stackexchange.com/questions/183230/bootstrapping-confidence-interval-from-a-regression-prediction 2. http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0048723&type=printable 3. ftp://ftp.esat.kuleuven.ac.be/sista/suykens/reports/10_156_v0.pdf But, I don't fully understand the details in #2 and #3 to the point that I can write a step by step code. If I use bootstrap method, I can get the confidence interval as follows? a. Draw bootstrap sample of size n b. Fit the SVR model (with hyperparameters chosen during model selection with grid search cv) to this bootstrap sample c. Use this model to predict the output variable y* from input variable X* d. Repeat step a-c for, for instance, 100 times e. Order the 100 values of y*, and determine, for instance, the 10th percentile and 90th percentile (if we are looking for 0.8 confidence interval) f. Repeat a-e for different values of X* to plot the prediction with confidence interval But, I don't know how to get the prediction interval from here. Thank you very much, Raga -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Wed Mar 1 15:13:41 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Wed, 1 Mar 2017 15:13:41 -0500 Subject: [scikit-learn] Confidence and Prediction Intervals of Support Vector Regression In-Reply-To: References: Message-ID: Hi, Raga, I have a short section on this here (https://sebastianraschka.com/blog/2016/model-evaluation-selection-part2.html#the-bootstrap-method-and-empirical-confidence-intervals) if it helps. Best, Sebastian > On Mar 1, 2017, at 3:07 PM, Raga Markely wrote: > > Hi everyone, > > I wonder if you could provide me with some suggestions on how to determine the confidence and prediction intervals of SVR? If you have suggestions for any machine learning algorithms in general, that would be fine too (doesn't have to be specific for SVR). > > So far, I have found: > 1. Bootstrap: http://stats.stackexchange.com/questions/183230/bootstrapping-confidence-interval-from-a-regression-prediction > 2. http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0048723&type=printable > 3. ftp://ftp.esat.kuleuven.ac.be/sista/suykens/reports/10_156_v0.pdf > > But, I don't fully understand the details in #2 and #3 to the point that I can write a step by step code. If I use bootstrap method, I can get the confidence interval as follows? > a. Draw bootstrap sample of size n > b. Fit the SVR model (with hyperparameters chosen during model selection with grid search cv) to this bootstrap sample > c. Use this model to predict the output variable y* from input variable X* > d. Repeat step a-c for, for instance, 100 times > e. Order the 100 values of y*, and determine, for instance, the 10th percentile and 90th percentile (if we are looking for 0.8 confidence interval) > f. Repeat a-e for different values of X* to plot the prediction with confidence interval > > But, I don't know how to get the prediction interval from here. > > Thank you very much, > Raga > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From raga.markely at gmail.com Wed Mar 1 17:39:52 2017 From: raga.markely at gmail.com (Raga Markely) Date: Wed, 1 Mar 2017 17:39:52 -0500 Subject: [scikit-learn] Confidence and Prediction Intervals of Support Vector Regression In-Reply-To: References: Message-ID: Thanks a lot, Sebastian! Very nicely written. I have a few follow-up questions: 1. Just to make sure I understand correctly, using the .632+ bootstrap method, the ACC_lower and ACC_upper are the lower and higher percentile of the ACC_h,i distribution? 2. For regression algorithms, is there a recommended equation for the no-information rate gamma? 3. I need to plot the confidence interval and prediction interval for my Support Vector Regression prediction (just to clarify these intervals, please see an analogy from linear model on slide 14: http://www2.stat.duke.edu/~tjl13/s101/slides/unit6lec3H.pdf) - can I derive the intervals from .632+ bootstrap method or is there a different way of getting these intervals? Thank you! Raga On Wed, Mar 1, 2017 at 3:13 PM, Sebastian Raschka wrote: > Hi, Raga, > I have a short section on this here (https://sebastianraschka.com/ > blog/2016/model-evaluation-selection-part2.html#the-bootstrap-method-and- > empirical-confidence-intervals) if it helps. > > Best, > Sebastian > > > On Mar 1, 2017, at 3:07 PM, Raga Markely wrote: > > > > Hi everyone, > > > > I wonder if you could provide me with some suggestions on how to > determine the confidence and prediction intervals of SVR? If you have > suggestions for any machine learning algorithms in general, that would be > fine too (doesn't have to be specific for SVR). > > > > So far, I have found: > > 1. Bootstrap: http://stats.stackexchange.com/questions/183230/ > bootstrapping-confidence-interval-from-a-regression-prediction > > 2. http://journals.plos.org/plosone/article/file?id=10. > 1371/journal.pone.0048723&type=printable > > 3. ftp://ftp.esat.kuleuven.ac.be/sista/suykens/reports/10_156_v0.pdf > > > > But, I don't fully understand the details in #2 and #3 to the point that > I can write a step by step code. If I use bootstrap method, I can get the > confidence interval as follows? > > a. Draw bootstrap sample of size n > > b. Fit the SVR model (with hyperparameters chosen during model selection > with grid search cv) to this bootstrap sample > > c. Use this model to predict the output variable y* from input variable > X* > > d. Repeat step a-c for, for instance, 100 times > > e. Order the 100 values of y*, and determine, for instance, the 10th > percentile and 90th percentile (if we are looking for 0.8 confidence > interval) > > f. Repeat a-e for different values of X* to plot the prediction with > confidence interval > > > > But, I don't know how to get the prediction interval from here. > > > > Thank you very much, > > Raga > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Wed Mar 1 21:44:13 2017 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Wed, 1 Mar 2017 21:44:13 -0500 Subject: [scikit-learn] Confidence and Prediction Intervals of Support Vector Regression In-Reply-To: References: Message-ID: <7B05F1AE-FCE4-413E-B96A-773EEF2D7947@sebastianraschka.com> Hi, Raga, > 1. Just to make sure I understand correctly, using the .632+ bootstrap method, the ACC_lower and ACC_upper are the lower and higher percentile of the ACC_h,i distribution? phew, I am actually not sure anymore ? I think it?s the percentile of the ACC_boot distribution, similar to the ?classic? bootstrap but where ACC_boot got computed from weighted ACC_h,i and ACC_r,i > 2. For regression algorithms, is there a recommended equation for the no-information rate gamma? Sorry, can?t be of much help here; I am not sure what the equivalent of the no-information rate for regression would be ... > On Mar 1, 2017, at 5:39 PM, Raga Markely wrote: > > Thanks a lot, Sebastian! Very nicely written. > > I have a few follow-up questions: > 1. Just to make sure I understand correctly, using the .632+ bootstrap method, the ACC_lower and ACC_upper are the lower and higher percentile of the ACC_h,i distribution? > 2. For regression algorithms, is there a recommended equation for the no-information rate gamma? > 3. I need to plot the confidence interval and prediction interval for my Support Vector Regression prediction (just to clarify these intervals, please see an analogy from linear model on slide 14: http://www2.stat.duke.edu/~tjl13/s101/slides/unit6lec3H.pdf) - can I derive the intervals from .632+ bootstrap method or is there a different way of getting these intervals? > > Thank you! > Raga > > > On Wed, Mar 1, 2017 at 3:13 PM, Sebastian Raschka wrote: > Hi, Raga, > I have a short section on this here (https://sebastianraschka.com/blog/2016/model-evaluation-selection-part2.html#the-bootstrap-method-and-empirical-confidence-intervals) if it helps. > > Best, > Sebastian > > > On Mar 1, 2017, at 3:07 PM, Raga Markely wrote: > > > > Hi everyone, > > > > I wonder if you could provide me with some suggestions on how to determine the confidence and prediction intervals of SVR? If you have suggestions for any machine learning algorithms in general, that would be fine too (doesn't have to be specific for SVR). > > > > So far, I have found: > > 1. Bootstrap: http://stats.stackexchange.com/questions/183230/bootstrapping-confidence-interval-from-a-regression-prediction > > 2. http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0048723&type=printable > > 3. ftp://ftp.esat.kuleuven.ac.be/sista/suykens/reports/10_156_v0.pdf > > > > But, I don't fully understand the details in #2 and #3 to the point that I can write a step by step code. If I use bootstrap method, I can get the confidence interval as follows? > > a. Draw bootstrap sample of size n > > b. Fit the SVR model (with hyperparameters chosen during model selection with grid search cv) to this bootstrap sample > > c. Use this model to predict the output variable y* from input variable X* > > d. Repeat step a-c for, for instance, 100 times > > e. Order the 100 values of y*, and determine, for instance, the 10th percentile and 90th percentile (if we are looking for 0.8 confidence interval) > > f. Repeat a-e for different values of X* to plot the prediction with confidence interval > > > > But, I don't know how to get the prediction interval from here. > > > > Thank you very much, > > Raga > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From se.raschka at gmail.com Wed Mar 1 21:46:51 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Wed, 1 Mar 2017 21:46:51 -0500 Subject: [scikit-learn] Confidence and Prediction Intervals of Support Vector Regression In-Reply-To: References: Message-ID: <1D9AA4D5-7429-4CE6-BF91-B6B303AC3530@gmail.com> Hi, Raga, > 1. Just to make sure I understand correctly, using the .632+ bootstrap method, the ACC_lower and ACC_upper are the lower and higher percentile of the ACC_h,i distribution? phew, I am actually not sure anymore ? I think it?s the percentile of the ACC_boot distribution, similar to the ?classic? bootstrap but where ACC_boot got computed from weighted ACC_h,i and ACC_r,i > 2. For regression algorithms, is there a recommended equation for the no-information rate gamma? Sorry, can?t be of much help here; I am not sure what the equivalent of the no-information rate for regression would be ... > On Mar 1, 2017, at 5:39 PM, Raga Markely wrote: > > Thanks a lot, Sebastian! Very nicely written. > > I have a few follow-up questions: > 1. Just to make sure I understand correctly, using the .632+ bootstrap method, the ACC_lower and ACC_upper are the lower and higher percentile of the ACC_h,i distribution? > 2. For regression algorithms, is there a recommended equation for the no-information rate gamma? > 3. I need to plot the confidence interval and prediction interval for my Support Vector Regression prediction (just to clarify these intervals, please see an analogy from linear model on slide 14: http://www2.stat.duke.edu/~tjl13/s101/slides/unit6lec3H.pdf) - can I derive the intervals from .632+ bootstrap method or is there a different way of getting these intervals? > > Thank you! > Raga > > > On Wed, Mar 1, 2017 at 3:13 PM, Sebastian Raschka wrote: > Hi, Raga, > I have a short section on this here (https://sebastianraschka.com/blog/2016/model-evaluation-selection-part2.html#the-bootstrap-method-and-empirical-confidence-intervals) if it helps. > > Best, > Sebastian > > > On Mar 1, 2017, at 3:07 PM, Raga Markely wrote: > > > > Hi everyone, > > > > I wonder if you could provide me with some suggestions on how to determine the confidence and prediction intervals of SVR? If you have suggestions for any machine learning algorithms in general, that would be fine too (doesn't have to be specific for SVR). > > > > So far, I have found: > > 1. Bootstrap: http://stats.stackexchange.com/questions/183230/bootstrapping-confidence-interval-from-a-regression-prediction > > 2. http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0048723&type=printable > > 3. ftp://ftp.esat.kuleuven.ac.be/sista/suykens/reports/10_156_v0.pdf > > > > But, I don't fully understand the details in #2 and #3 to the point that I can write a step by step code. If I use bootstrap method, I can get the confidence interval as follows? > > a. Draw bootstrap sample of size n > > b. Fit the SVR model (with hyperparameters chosen during model selection with grid search cv) to this bootstrap sample > > c. Use this model to predict the output variable y* from input variable X* > > d. Repeat step a-c for, for instance, 100 times > > e. Order the 100 values of y*, and determine, for instance, the 10th percentile and 90th percentile (if we are looking for 0.8 confidence interval) > > f. Repeat a-e for different values of X* to plot the prediction with confidence interval > > > > But, I don't know how to get the prediction interval from here. > > > > Thank you very much, > > Raga > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From raga.markely at gmail.com Wed Mar 1 22:07:35 2017 From: raga.markely at gmail.com (Raga Markely) Date: Wed, 1 Mar 2017 22:07:35 -0500 Subject: [scikit-learn] Confidence and Prediction Intervals of Support Vector Regression In-Reply-To: <7B05F1AE-FCE4-413E-B96A-773EEF2D7947@sebastianraschka.com> References: <7B05F1AE-FCE4-413E-B96A-773EEF2D7947@sebastianraschka.com> Message-ID: No worries, Sebastian :) .. thank you very much for your help.. I learned a lot of new things from your site today.. it led me to some relevant chapters in "The Elements of Statistical Learning", which then led me to chapter 8 page 264 about non-parametric & parametric bootstrap.. I think I will just go with the non-parametric bootstrap for my problem.. similar to the bootstrap steps i mentioned earlier.. Thank you! Raga On Wed, Mar 1, 2017 at 9:44 PM, Sebastian Raschka wrote: > Hi, Raga, > > > 1. Just to make sure I understand correctly, using the .632+ bootstrap > method, the ACC_lower and ACC_upper are the lower and higher percentile of > the ACC_h,i distribution? > > phew, I am actually not sure anymore ? I think it?s the percentile of the > ACC_boot distribution, similar to the ?classic? bootstrap but where > ACC_boot got computed from weighted ACC_h,i and ACC_r,i > > > 2. For regression algorithms, is there a recommended equation for the > no-information rate gamma? > > > Sorry, can?t be of much help here; I am not sure what the equivalent of > the no-information rate for regression would be ... > > > > > On Mar 1, 2017, at 5:39 PM, Raga Markely wrote: > > > > Thanks a lot, Sebastian! Very nicely written. > > > > I have a few follow-up questions: > > 1. Just to make sure I understand correctly, using the .632+ bootstrap > method, the ACC_lower and ACC_upper are the lower and higher percentile of > the ACC_h,i distribution? > > 2. For regression algorithms, is there a recommended equation for the > no-information rate gamma? > > 3. I need to plot the confidence interval and prediction interval for my > Support Vector Regression prediction (just to clarify these intervals, > please see an analogy from linear model on slide 14: > http://www2.stat.duke.edu/~tjl13/s101/slides/unit6lec3H.pdf) - can I > derive the intervals from .632+ bootstrap method or is there a different > way of getting these intervals? > > > > Thank you! > > Raga > > > > > > On Wed, Mar 1, 2017 at 3:13 PM, Sebastian Raschka > wrote: > > Hi, Raga, > > I have a short section on this here (https://sebastianraschka.com/ > blog/2016/model-evaluation-selection-part2.html#the-bootstrap-method-and- > empirical-confidence-intervals) if it helps. > > > > Best, > > Sebastian > > > > > On Mar 1, 2017, at 3:07 PM, Raga Markely > wrote: > > > > > > Hi everyone, > > > > > > I wonder if you could provide me with some suggestions on how to > determine the confidence and prediction intervals of SVR? If you have > suggestions for any machine learning algorithms in general, that would be > fine too (doesn't have to be specific for SVR). > > > > > > So far, I have found: > > > 1. Bootstrap: http://stats.stackexchange.com/questions/183230/ > bootstrapping-confidence-interval-from-a-regression-prediction > > > 2. http://journals.plos.org/plosone/article/file?id=10. > 1371/journal.pone.0048723&type=printable > > > 3. ftp://ftp.esat.kuleuven.ac.be/sista/suykens/reports/10_156_v0.pdf > > > > > > But, I don't fully understand the details in #2 and #3 to the point > that I can write a step by step code. If I use bootstrap method, I can get > the confidence interval as follows? > > > a. Draw bootstrap sample of size n > > > b. Fit the SVR model (with hyperparameters chosen during model > selection with grid search cv) to this bootstrap sample > > > c. Use this model to predict the output variable y* from input > variable X* > > > d. Repeat step a-c for, for instance, 100 times > > > e. Order the 100 values of y*, and determine, for instance, the 10th > percentile and 90th percentile (if we are looking for 0.8 confidence > interval) > > > f. Repeat a-e for different values of X* to plot the prediction with > confidence interval > > > > > > But, I don't know how to get the prediction interval from here. > > > > > > Thank you very much, > > > Raga > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Wed Mar 1 22:13:02 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Wed, 1 Mar 2017 22:13:02 -0500 Subject: [scikit-learn] Confidence and Prediction Intervals of Support Vector Regression In-Reply-To: References: <7B05F1AE-FCE4-413E-B96A-773EEF2D7947@sebastianraschka.com> Message-ID: <34DD2CC4-D6E1-4FFD-B53E-095EF1DAB7B8@gmail.com> Glad to hear that it was at least a little bit helpful :) (haha, Efron and Tibshirani even have a whole ~500 pg book on bootstrap if you have the time and patience ? :) https://www.crcpress.com/An-Introduction-to-the-Bootstrap/Efron-Tibshirani/p/book/9780412042317) > On Mar 1, 2017, at 10:07 PM, Raga Markely wrote: > > No worries, Sebastian :) .. thank you very much for your help.. I learned a lot of new things from your site today.. it led me to some relevant chapters in "The Elements of Statistical Learning", which then led me to chapter 8 page 264 about non-parametric & parametric bootstrap.. > > I think I will just go with the non-parametric bootstrap for my problem.. similar to the bootstrap steps i mentioned earlier.. > > Thank you! > Raga > > On Wed, Mar 1, 2017 at 9:44 PM, Sebastian Raschka wrote: > Hi, Raga, > > > 1. Just to make sure I understand correctly, using the .632+ bootstrap method, the ACC_lower and ACC_upper are the lower and higher percentile of the ACC_h,i distribution? > > phew, I am actually not sure anymore ? I think it?s the percentile of the ACC_boot distribution, similar to the ?classic? bootstrap but where ACC_boot got computed from weighted ACC_h,i and ACC_r,i > > > 2. For regression algorithms, is there a recommended equation for the no-information rate gamma? > > > Sorry, can?t be of much help here; I am not sure what the equivalent of the no-information rate for regression would be ... > > > > > On Mar 1, 2017, at 5:39 PM, Raga Markely wrote: > > > > Thanks a lot, Sebastian! Very nicely written. > > > > I have a few follow-up questions: > > 1. Just to make sure I understand correctly, using the .632+ bootstrap method, the ACC_lower and ACC_upper are the lower and higher percentile of the ACC_h,i distribution? > > 2. For regression algorithms, is there a recommended equation for the no-information rate gamma? > > 3. I need to plot the confidence interval and prediction interval for my Support Vector Regression prediction (just to clarify these intervals, please see an analogy from linear model on slide 14: http://www2.stat.duke.edu/~tjl13/s101/slides/unit6lec3H.pdf) - can I derive the intervals from .632+ bootstrap method or is there a different way of getting these intervals? > > > > Thank you! > > Raga > > > > > > On Wed, Mar 1, 2017 at 3:13 PM, Sebastian Raschka wrote: > > Hi, Raga, > > I have a short section on this here (https://sebastianraschka.com/blog/2016/model-evaluation-selection-part2.html#the-bootstrap-method-and-empirical-confidence-intervals) if it helps. > > > > Best, > > Sebastian > > > > > On Mar 1, 2017, at 3:07 PM, Raga Markely wrote: > > > > > > Hi everyone, > > > > > > I wonder if you could provide me with some suggestions on how to determine the confidence and prediction intervals of SVR? If you have suggestions for any machine learning algorithms in general, that would be fine too (doesn't have to be specific for SVR). > > > > > > So far, I have found: > > > 1. Bootstrap: http://stats.stackexchange.com/questions/183230/bootstrapping-confidence-interval-from-a-regression-prediction > > > 2. http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0048723&type=printable > > > 3. ftp://ftp.esat.kuleuven.ac.be/sista/suykens/reports/10_156_v0.pdf > > > > > > But, I don't fully understand the details in #2 and #3 to the point that I can write a step by step code. If I use bootstrap method, I can get the confidence interval as follows? > > > a. Draw bootstrap sample of size n > > > b. Fit the SVR model (with hyperparameters chosen during model selection with grid search cv) to this bootstrap sample > > > c. Use this model to predict the output variable y* from input variable X* > > > d. Repeat step a-c for, for instance, 100 times > > > e. Order the 100 values of y*, and determine, for instance, the 10th percentile and 90th percentile (if we are looking for 0.8 confidence interval) > > > f. Repeat a-e for different values of X* to plot the prediction with confidence interval > > > > > > But, I don't know how to get the prediction interval from here. > > > > > > Thank you very much, > > > Raga > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From raga.markely at gmail.com Wed Mar 1 22:17:13 2017 From: raga.markely at gmail.com (Raga Markely) Date: Wed, 1 Mar 2017 22:17:13 -0500 Subject: [scikit-learn] Confidence and Prediction Intervals of Support Vector Regression In-Reply-To: <34DD2CC4-D6E1-4FFD-B53E-095EF1DAB7B8@gmail.com> References: <7B05F1AE-FCE4-413E-B96A-773EEF2D7947@sebastianraschka.com> <34DD2CC4-D6E1-4FFD-B53E-095EF1DAB7B8@gmail.com> Message-ID: that's a very serious dedication to bootstrap :) On Wed, Mar 1, 2017 at 10:13 PM, Sebastian Raschka wrote: > Glad to hear that it was at least a little bit helpful :) > (haha, Efron and Tibshirani even have a whole ~500 pg book on bootstrap if > you have the time and patience ? :) https://www.crcpress.com/An- > Introduction-to-the-Bootstrap/Efron-Tibshirani/p/book/9780412042317) > > > On Mar 1, 2017, at 10:07 PM, Raga Markely > wrote: > > > > No worries, Sebastian :) .. thank you very much for your help.. I > learned a lot of new things from your site today.. it led me to some > relevant chapters in "The Elements of Statistical Learning", which then led > me to chapter 8 page 264 about non-parametric & parametric bootstrap.. > > > > I think I will just go with the non-parametric bootstrap for my > problem.. similar to the bootstrap steps i mentioned earlier.. > > > > Thank you! > > Raga > > > > On Wed, Mar 1, 2017 at 9:44 PM, Sebastian Raschka < > mail at sebastianraschka.com> wrote: > > Hi, Raga, > > > > > 1. Just to make sure I understand correctly, using the .632+ bootstrap > method, the ACC_lower and ACC_upper are the lower and higher percentile of > the ACC_h,i distribution? > > > > phew, I am actually not sure anymore ? I think it?s the percentile of > the ACC_boot distribution, similar to the ?classic? bootstrap but where > ACC_boot got computed from weighted ACC_h,i and ACC_r,i > > > > > 2. For regression algorithms, is there a recommended equation for the > no-information rate gamma? > > > > > > Sorry, can?t be of much help here; I am not sure what the equivalent of > the no-information rate for regression would be ... > > > > > > > > > On Mar 1, 2017, at 5:39 PM, Raga Markely > wrote: > > > > > > Thanks a lot, Sebastian! Very nicely written. > > > > > > I have a few follow-up questions: > > > 1. Just to make sure I understand correctly, using the .632+ bootstrap > method, the ACC_lower and ACC_upper are the lower and higher percentile of > the ACC_h,i distribution? > > > 2. For regression algorithms, is there a recommended equation for the > no-information rate gamma? > > > 3. I need to plot the confidence interval and prediction interval for > my Support Vector Regression prediction (just to clarify these intervals, > please see an analogy from linear model on slide 14: > http://www2.stat.duke.edu/~tjl13/s101/slides/unit6lec3H.pdf) - can I > derive the intervals from .632+ bootstrap method or is there a different > way of getting these intervals? > > > > > > Thank you! > > > Raga > > > > > > > > > On Wed, Mar 1, 2017 at 3:13 PM, Sebastian Raschka < > se.raschka at gmail.com> wrote: > > > Hi, Raga, > > > I have a short section on this here (https://sebastianraschka.com/ > blog/2016/model-evaluation-selection-part2.html#the-bootstrap-method-and- > empirical-confidence-intervals) if it helps. > > > > > > Best, > > > Sebastian > > > > > > > On Mar 1, 2017, at 3:07 PM, Raga Markely > wrote: > > > > > > > > Hi everyone, > > > > > > > > I wonder if you could provide me with some suggestions on how to > determine the confidence and prediction intervals of SVR? If you have > suggestions for any machine learning algorithms in general, that would be > fine too (doesn't have to be specific for SVR). > > > > > > > > So far, I have found: > > > > 1. Bootstrap: http://stats.stackexchange.com/questions/183230/ > bootstrapping-confidence-interval-from-a-regression-prediction > > > > 2. http://journals.plos.org/plosone/article/file?id=10. > 1371/journal.pone.0048723&type=printable > > > > 3. ftp://ftp.esat.kuleuven.ac.be/sista/suykens/reports/10_156_v0.pdf > > > > > > > > But, I don't fully understand the details in #2 and #3 to the point > that I can write a step by step code. If I use bootstrap method, I can get > the confidence interval as follows? > > > > a. Draw bootstrap sample of size n > > > > b. Fit the SVR model (with hyperparameters chosen during model > selection with grid search cv) to this bootstrap sample > > > > c. Use this model to predict the output variable y* from input > variable X* > > > > d. Repeat step a-c for, for instance, 100 times > > > > e. Order the 100 values of y*, and determine, for instance, the 10th > percentile and 90th percentile (if we are looking for 0.8 confidence > interval) > > > > f. Repeat a-e for different values of X* to plot the prediction with > confidence interval > > > > > > > > But, I don't know how to get the prediction interval from here. > > > > > > > > Thank you very much, > > > > Raga > > > > _______________________________________________ > > > > scikit-learn mailing list > > > > scikit-learn at python.org > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shubham.bhardwaj2015 at vit.ac.in Thu Mar 2 04:07:36 2017 From: shubham.bhardwaj2015 at vit.ac.in (SHUBHAM BHARDWAJ 15BCE0704) Date: Thu, 2 Mar 2017 14:37:36 +0530 Subject: [scikit-learn] GSoc, 2017 (proposal idea and intro) .reg Message-ID: Hello Sir, My introduction : I am a 2nd year student studying Computer Science and engineering from VIT, Vellore. I work in Google Developers Group VIT. All my experience has been about collaborating with a lot of people ,working as a team, building products and learning along the way. Since scikit-learn is participating this time I am too planning to submit a proposal. Proposal idea: I am really interested in implementing kmeans++ algorithm.I was doing some work on DT but I found this very appealing. Just wanted to know if it can be a good project idea. Regards Shubham Bhardwaj -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Thu Mar 2 13:31:46 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Thu, 2 Mar 2017 10:31:46 -0800 Subject: [scikit-learn] GSoc, 2017 (proposal idea and intro) .reg In-Reply-To: References: Message-ID: Hi Shubham Thanks for your interest. I'm eager to see your contributions to sklearn in the future. However, I'm pretty sure kmeans++ is already implemented: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html Jacob On Thu, Mar 2, 2017 at 1:07 AM, SHUBHAM BHARDWAJ 15BCE0704 < shubham.bhardwaj2015 at vit.ac.in> wrote: > Hello Sir, > > My introduction : > I am a 2nd year student studying Computer Science and engineering from > VIT, Vellore. I work in Google Developers Group VIT. All my experience has > been about collaborating with a lot of people ,working as a team, building > products and learning along the way. > Since scikit-learn is participating this time I am too planning to submit > a proposal. > > Proposal idea: > I am really interested in implementing kmeans++ algorithm.I was doing some > work on DT but I found this very appealing. Just wanted to know if it can > be a good project idea. > > Regards > Shubham Bhardwaj > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shubham.bhardwaj2015 at vit.ac.in Thu Mar 2 20:00:12 2017 From: shubham.bhardwaj2015 at vit.ac.in (SHUBHAM BHARDWAJ 15BCE0704) Date: Fri, 3 Mar 2017 06:30:12 +0530 Subject: [scikit-learn] GSoc, 2017 (proposal idea and intro) .reg In-Reply-To: References: Message-ID: Hello Sir, Thanks a lot for the reply. Sorry for not being elaborate about what I was trying to address. I wanted to implement this [ http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf] (1200+citations)- mentioned in comments. This pertains to the stalled issue #4357 .Proposal idea - implementing a scalable kmeans++. Regards Shubham Bhardwaj On Fri, Mar 3, 2017 at 12:01 AM, Jacob Schreiber wrote: > Hi Shubham > > Thanks for your interest. I'm eager to see your contributions to sklearn > in the future. However, I'm pretty sure kmeans++ is already implemented: > http://scikit-learn.org/stable/modules/generated/sklearn.cluster. > KMeans.html > > Jacob > > On Thu, Mar 2, 2017 at 1:07 AM, SHUBHAM BHARDWAJ 15BCE0704 < > shubham.bhardwaj2015 at vit.ac.in> wrote: > >> Hello Sir, >> >> My introduction : >> I am a 2nd year student studying Computer Science and engineering from >> VIT, Vellore. I work in Google Developers Group VIT. All my experience has >> been about collaborating with a lot of people ,working as a team, building >> products and learning along the way. >> Since scikit-learn is participating this time I am too planning to submit >> a proposal. >> >> Proposal idea: >> I am really interested in implementing kmeans++ algorithm.I was doing >> some work on DT but I found this very appealing. Just wanted to know if it >> can be a good project idea. >> >> Regards >> Shubham Bhardwaj >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Thu Mar 2 20:10:32 2017 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Fri, 3 Mar 2017 02:10:32 +0100 Subject: [scikit-learn] GSoc, 2017 (proposal idea and intro) .reg In-Reply-To: References: Message-ID: I think that you mean this paper -> Scalable K-Means++ -> 218 citations On 3 March 2017 at 02:00, SHUBHAM BHARDWAJ 15BCE0704 < shubham.bhardwaj2015 at vit.ac.in> wrote: > Hello Sir, > > Thanks a lot for the reply. Sorry for not being elaborate about what I was > trying to address. I wanted to implement this [http://ilpubs.stanford.edu: > 8090/778/1/2006-13.pdf] (1200+citations)- mentioned in comments. This > pertains to the stalled issue #4357 .Proposal idea - implementing a > scalable kmeans++. > > Regards > Shubham Bhardwaj > > On Fri, Mar 3, 2017 at 12:01 AM, Jacob Schreiber > wrote: > >> Hi Shubham >> >> Thanks for your interest. I'm eager to see your contributions to sklearn >> in the future. However, I'm pretty sure kmeans++ is already implemented: >> http://scikit-learn.org/stable/modules/generate >> d/sklearn.cluster.KMeans.html >> >> Jacob >> >> On Thu, Mar 2, 2017 at 1:07 AM, SHUBHAM BHARDWAJ 15BCE0704 < >> shubham.bhardwaj2015 at vit.ac.in> wrote: >> >>> Hello Sir, >>> >>> My introduction : >>> I am a 2nd year student studying Computer Science and engineering from >>> VIT, Vellore. I work in Google Developers Group VIT. All my experience has >>> been about collaborating with a lot of people ,working as a team, building >>> products and learning along the way. >>> Since scikit-learn is participating this time I am too planning to >>> submit a proposal. >>> >>> Proposal idea: >>> I am really interested in implementing kmeans++ algorithm.I was doing >>> some work on DT but I found this very appealing. Just wanted to know if it >>> can be a good project idea. >>> >>> Regards >>> Shubham Bhardwaj >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Guillaume Lemaitre INRIA Saclay - Ile-de-France Equipe PARIETAL guillaume.lemaitre at inria.f r --- https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From shubham.bhardwaj2015 at vit.ac.in Thu Mar 2 20:23:16 2017 From: shubham.bhardwaj2015 at vit.ac.in (SHUBHAM BHARDWAJ 15BCE0704) Date: Fri, 3 Mar 2017 06:53:16 +0530 Subject: [scikit-learn] GSoc, 2017 (proposal idea and intro) .reg In-Reply-To: References: Message-ID: Hello Sir, Very Sorry for the numbers I saw this written in the comments.I assumed -Given the person who suggested the paper might have taken a look into the number of citations.I will make sure to personally check myself. Regards Shubham Bhardwaj On Fri, Mar 3, 2017 at 6:40 AM, Guillaume Lema?tre wrote: > I think that you mean this paper -> Scalable K-Means++ -> 218 citations > > On 3 March 2017 at 02:00, SHUBHAM BHARDWAJ 15BCE0704 < > shubham.bhardwaj2015 at vit.ac.in> wrote: > >> Hello Sir, >> >> Thanks a lot for the reply. Sorry for not being elaborate about what I >> was trying to address. I wanted to implement this [ >> http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf] (1200+citations)- >> mentioned in comments. This pertains to the stalled issue #4357 .Proposal >> idea - implementing a scalable kmeans++. >> >> Regards >> Shubham Bhardwaj >> >> On Fri, Mar 3, 2017 at 12:01 AM, Jacob Schreiber > > wrote: >> >>> Hi Shubham >>> >>> Thanks for your interest. I'm eager to see your contributions to sklearn >>> in the future. However, I'm pretty sure kmeans++ is already implemented: >>> http://scikit-learn.org/stable/modules/generate >>> d/sklearn.cluster.KMeans.html >>> >>> Jacob >>> >>> On Thu, Mar 2, 2017 at 1:07 AM, SHUBHAM BHARDWAJ 15BCE0704 < >>> shubham.bhardwaj2015 at vit.ac.in> wrote: >>> >>>> Hello Sir, >>>> >>>> My introduction : >>>> I am a 2nd year student studying Computer Science and engineering from >>>> VIT, Vellore. I work in Google Developers Group VIT. All my experience has >>>> been about collaborating with a lot of people ,working as a team, building >>>> products and learning along the way. >>>> Since scikit-learn is participating this time I am too planning to >>>> submit a proposal. >>>> >>>> Proposal idea: >>>> I am really interested in implementing kmeans++ algorithm.I was doing >>>> some work on DT but I found this very appealing. Just wanted to know if it >>>> can be a good project idea. >>>> >>>> Regards >>>> Shubham Bhardwaj >>>> >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Guillaume Lemaitre > INRIA Saclay - Ile-de-France > Equipe PARIETAL > guillaume.lemaitre at inria.f r --- > https://glemaitre.github.io/ > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ragvrv at gmail.com Fri Mar 3 12:56:27 2017 From: ragvrv at gmail.com (Raghav R V) Date: Fri, 3 Mar 2017 18:56:27 +0100 Subject: [scikit-learn] MAPE in scikit-learn? Message-ID: Hi all, Do we want Median Absolute Percentage Error in scikit-learn? Ref: KDD2017 - https://tianchi.shuju.aliyun.com/competition/information.htm?spm=5176.100067.5678.2.8CnCPt&raceId=231597 Thanks -- Raghav RV https://github.com/raghavrv -------------- next part -------------- An HTML attachment was scrubbed... URL: From ahowe42 at gmail.com Fri Mar 3 13:18:22 2017 From: ahowe42 at gmail.com (Andrew Howe) Date: Fri, 3 Mar 2017 21:18:22 +0300 Subject: [scikit-learn] MAPE in scikit-learn? In-Reply-To: References: Message-ID: I would think so. I've used it in research before. Andrew <~~~~~~~~~~~~~~~~~~~~~~~~~~~> J. Andrew Howe, PhD www.andrewhowe.com http://www.linkedin.com/in/ahowe42 https://www.researchgate.net/profile/John_Howe12/ I live to learn, so I can learn to live. - me <~~~~~~~~~~~~~~~~~~~~~~~~~~~> On Fri, Mar 3, 2017 at 8:56 PM, Raghav R V wrote: > Hi all, > > Do we want Median Absolute Percentage Error in scikit-learn? > > Ref: KDD2017 - https://tianchi.shuju.aliyun.com/competition/ > information.htm?spm=5176.100067.5678.2.8CnCPt&raceId=231597 > > Thanks > > -- > Raghav RV > https://github.com/raghavrv > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amanpratik10 at gmail.com Fri Mar 3 14:54:13 2017 From: amanpratik10 at gmail.com (Aman Pratik) Date: Sat, 4 Mar 2017 01:24:13 +0530 Subject: [scikit-learn] GSoC 2017 Message-ID: Hello Developers, This is Aman Pratik. I am currently pursuing my B.Tech from Indian Institute of Technology, Varanasi. I am a keen software developer and not very new to the open source community. I am interested in your project "*Improve online learning for linear models*" for GSoC 2017. I have been working in Python for the past 2 years and have good idea about Machine Learning algorithms. I am quite familiar with scikit-learn both as a user and a developer. These are the PRs I have worked/working on for the past few months. [MRG+1] Issue#5803 : Regression Test added #8112 [MRG] Issue#6673 : Make a wrapper around functions that score an individual feature [MRG] Issue #7987: Embarrassingly parallel "n_restarts_optimizer" in GaussianProcessRegressor My GitHub Profile: https://www.github.com/amanp10 I have basic knowledge about SGD (Stochastic Gradient Descent) and related algorithms. Also, I am familiar with Benchmark tests, Unit tests and other technical knowledge I would require for this project. I have started my study for the subject and am looking forward to guidance from the potential mentors or anyone willing to help. Thank You -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Fri Mar 3 17:36:21 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Fri, 3 Mar 2017 14:36:21 -0800 Subject: [scikit-learn] Scipy 2017 In-Reply-To: References: <65ef1d1c-28a9-0772-6da1-3b54feb7cfd1@gmail.com> Message-ID: Do you still need someone to help with the tutorial? I may be able to attend. On Tue, Feb 28, 2017 at 9:43 AM, Nelson Liu wrote: > The conference generally (at least for the last three years) uploads > recordings of the tutorials afterwards, e.g. here > is part one of the > scikit-learn tutorial at Scipy 2016. I would assume that they are doing > this again. > > Nelson Liu > > On Tue, Feb 28, 2017 at 9:37 AM, Ruchika Nayyar > wrote: > >> Hello >> >> Will there be a video link ? >> >> Thanks, >> Ruchika >> ---------------------------------------- >> Dr Ruchika Nayyar, >> Post Doctoral Fellow for ATLAS Collaboration >> University of Arizona >> Arizona, USA. >> -------------------------------------------- >> >> On Mon, Feb 27, 2017 at 2:20 PM, Alexandre Gramfort < >> alexandre.gramfort at telecom-paristech.fr> wrote: >> >>> Hi Andy, >>> >>> I'll be happy to share the stage with you for a tutorial. >>> >>> Alex >>> >>> >>> On Tue, Feb 21, 2017 at 3:52 PM, Andreas Mueller >>> wrote: >>> > Hey folks. >>> > Who's coming to scipy this year? >>> > Any volunteers for tutorials? I'm happy to be part of it but doing 7h >>> by >>> > myself is a bit much ;) >>> > >>> > >>> > Andy >>> > _______________________________________________ >>> > scikit-learn mailing list >>> > scikit-learn at python.org >>> > https://mail.python.org/mailman/listinfo/scikit-learn >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mikhail at combust.ml Sat Mar 4 17:12:11 2017 From: mikhail at combust.ml (Mikhail Semeniuk) Date: Sat, 4 Mar 2017 14:12:11 -0800 Subject: [scikit-learn] MAPE in scikit-learn? In-Reply-To: References: Message-ID: +1 On Fri, Mar 3, 2017 at 10:18 AM, Andrew Howe wrote: > I would think so. I've used it in research before. > > Andrew > > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > J. Andrew Howe, PhD > www.andrewhowe.com > http://www.linkedin.com/in/ahowe42 > https://www.researchgate.net/profile/John_Howe12/ > I live to learn, so I can learn to live. - me > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > > On Fri, Mar 3, 2017 at 8:56 PM, Raghav R V wrote: > >> Hi all, >> >> Do we want Median Absolute Percentage Error in scikit-learn? >> >> Ref: KDD2017 - https://tianchi.shuju.aliyun.com/competition/information. >> htm?spm=5176.100067.5678.2.8CnCPt&raceId=231597 >> >> Thanks >> >> -- >> Raghav RV >> https://github.com/raghavrv >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Sun Mar 5 12:42:07 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Sun, 5 Mar 2017 12:42:07 -0500 Subject: [scikit-learn] MAPE in scikit-learn? In-Reply-To: References: Message-ID: <03202134-9d33-f31d-b448-ba00f1bb6a54@gmail.com> +1 On 03/04/2017 05:12 PM, Mikhail Semeniuk wrote: > +1 > > On Fri, Mar 3, 2017 at 10:18 AM, Andrew Howe > wrote: > > I would think so. I've used it in research before. > > Andrew > > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > J. Andrew Howe, PhD > www.andrewhowe.com > http://www.linkedin.com/in/ahowe42 > > https://www.researchgate.net/profile/John_Howe12/ > > I live to learn, so I can learn to live. - me > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > > On Fri, Mar 3, 2017 at 8:56 PM, Raghav R V > wrote: > > Hi all, > > Do we want Median Absolute Percentage Error in scikit-learn? > > Ref: KDD2017 - > https://tianchi.shuju.aliyun.com/competition/information.htm?spm=5176.100067.5678.2.8CnCPt&raceId=231597 > > > Thanks > > -- > Raghav RV > https://github.com/raghavrv > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Sun Mar 5 12:42:20 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Sun, 5 Mar 2017 12:42:20 -0500 Subject: [scikit-learn] Scipy 2017 In-Reply-To: References: <65ef1d1c-28a9-0772-6da1-3b54feb7cfd1@gmail.com> Message-ID: I'm gonna do it with Alex :) On 03/03/2017 05:36 PM, Jacob Schreiber wrote: > Do you still need someone to help with the tutorial? I may be able to > attend. > > On Tue, Feb 28, 2017 at 9:43 AM, Nelson Liu > wrote: > > The conference generally (at least for the last three years) > uploads recordings of the tutorials afterwards, e.g. here > is part one of the > scikit-learn tutorial at Scipy 2016. I would assume that they are > doing this again. > > Nelson Liu > > On Tue, Feb 28, 2017 at 9:37 AM, Ruchika Nayyar > > wrote: > > Hello > > Will there be a video link ? > > Thanks, > Ruchika > ---------------------------------------- > Dr Ruchika Nayyar, > Post Doctoral Fellow for ATLAS Collaboration > University of Arizona > Arizona, USA. > -------------------------------------------- > > On Mon, Feb 27, 2017 at 2:20 PM, Alexandre Gramfort > > wrote: > > Hi Andy, > > I'll be happy to share the stage with you for a tutorial. > > Alex > > > On Tue, Feb 21, 2017 at 3:52 PM, Andreas Mueller > > wrote: > > Hey folks. > > Who's coming to scipy this year? > > Any volunteers for tutorials? I'm happy to be part of it > but doing 7h by > > myself is a bit much ;) > > > > > > Andy > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Sun Mar 5 12:47:09 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Sun, 5 Mar 2017 12:47:09 -0500 Subject: [scikit-learn] Scikit-learn survey results Message-ID: <23963584-ae33-db28-90d8-6e1479e3f862@gmail.com> Hey all. In case you're interested, here is a summary view of the scikit-learn survey I posted recently: https://www.surveymonkey.com/results/SM-RHGZVZ73/ tldr; Preprocessing takes the most time, people want out-of-core learning, better integration with pandas and easier visualization of models and data. People would use automatic machine learning if it was there, but it's not the highest priority item. There is also a lot of interesting info in the comments, but because I was not able to go through all of them yet, I don't want to publish them publicly in case there is sensitive information included (and if anyone knows if there are legal implications if there wasn't a disclaimer, please let me know). Cheers, Andy From t3kcit at gmail.com Sun Mar 5 14:02:01 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Sun, 5 Mar 2017 14:02:01 -0500 Subject: [scikit-learn] GSoc, 2017 (proposal idea and intro) .reg In-Reply-To: References: Message-ID: There was a PR here: https://github.com/scikit-learn/scikit-learn/pull/5530 but it didn't seem to work. Feel free to convince us otherwise ;) On 03/02/2017 08:23 PM, SHUBHAM BHARDWAJ 15BCE0704 wrote: > Hello Sir, > Very Sorry for the numbers I saw this written in the comments.I > assumed -Given the person who suggested the paper might have taken a > look into the number of citations.I will make sure to personally check > myself. > > Regards > Shubham Bhardwaj > > On Fri, Mar 3, 2017 at 6:40 AM, Guillaume Lema?tre > > wrote: > > I think that you mean this paper -> Scalable K-Means++ -> 218 > citations > > On 3 March 2017 at 02:00, SHUBHAM BHARDWAJ 15BCE0704 > > wrote: > > Hello Sir, > > Thanks a lot for the reply. Sorry for not being elaborate > about what I was trying to address. I wanted to implement this > [http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf > ] > (1200+citations)- mentioned in comments. This pertains to the > stalled issue #4357 .Proposal idea - implementing a scalable > kmeans++. > > Regards > Shubham Bhardwaj > > On Fri, Mar 3, 2017 at 12:01 AM, Jacob Schreiber > > wrote: > > Hi Shubham > > Thanks for your interest. I'm eager to see your > contributions to sklearn in the future. However, I'm > pretty sure kmeans++ is already implemented: > http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html > > > Jacob > > On Thu, Mar 2, 2017 at 1:07 AM, SHUBHAM BHARDWAJ 15BCE0704 > > wrote: > > Hello Sir, > > My introduction : > I am a 2nd year student studying Computer Science and > engineering from VIT, Vellore. I work in Google > Developers Group VIT. All my experience has been about > collaborating with a lot of people ,working as a team, > building products and learning along the way. > Since scikit-learn is participating this time I am too > planning to submit a proposal. > > Proposal idea: > I am really interested in implementing kmeans++ > algorithm.I was doing some work on DT but I found this > very appealing. Just wanted to know if it can be a > good project idea. > > Regards > Shubham Bhardwaj > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > -- > Guillaume Lemaitre > INRIA Saclay - Ile-de-France > Equipe PARIETAL > guillaume.lemaitre at inria.f r > --- https://glemaitre.github.io/ > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From shubham.bhardwaj2015 at vit.ac.in Mon Mar 6 06:25:05 2017 From: shubham.bhardwaj2015 at vit.ac.in (SHUBHAM BHARDWAJ 15BCE0704) Date: Mon, 6 Mar 2017 16:55:05 +0530 Subject: [scikit-learn] GSoc, 2017 (proposal idea and intro) .reg In-Reply-To: References: Message-ID: Hello Sir, Thanks for the reply, I will try to reproduce the claims of the paper and would update about my progress. Regards Shubham On Mon, Mar 6, 2017 at 12:32 AM, Andreas Mueller wrote: > There was a PR here: > https://github.com/scikit-learn/scikit-learn/pull/5530 > > but it didn't seem to work. Feel free to convince us otherwise ;) > > > > On 03/02/2017 08:23 PM, SHUBHAM BHARDWAJ 15BCE0704 wrote: > > Hello Sir, > Very Sorry for the numbers I saw this written in the comments.I assumed > -Given the person who suggested the paper might have taken a look into the > number of citations.I will make sure to personally check myself. > > Regards > Shubham Bhardwaj > > On Fri, Mar 3, 2017 at 6:40 AM, Guillaume Lema?tre > wrote: > >> I think that you mean this paper -> Scalable K-Means++ -> 218 citations >> >> On 3 March 2017 at 02:00, SHUBHAM BHARDWAJ 15BCE0704 < >> shubham.bhardwaj2015 at vit.ac.in> wrote: >> >>> Hello Sir, >>> >>> Thanks a lot for the reply. Sorry for not being elaborate about what I >>> was trying to address. I wanted to implement this [ >>> http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf] (1200+citations)- >>> mentioned in comments. This pertains to the stalled issue #4357 .Proposal >>> idea - implementing a scalable kmeans++. >>> >>> Regards >>> Shubham Bhardwaj >>> >>> On Fri, Mar 3, 2017 at 12:01 AM, Jacob Schreiber < >>> jmschreiber91 at gmail.com> wrote: >>> >>>> Hi Shubham >>>> >>>> Thanks for your interest. I'm eager to see your contributions to >>>> sklearn in the future. However, I'm pretty sure kmeans++ is already >>>> implemented: http://scikit-learn.org/stable/modules/generate >>>> d/sklearn.cluster.KMeans.html >>>> >>>> Jacob >>>> >>>> On Thu, Mar 2, 2017 at 1:07 AM, SHUBHAM BHARDWAJ 15BCE0704 < >>>> shubham.bhardwaj2015 at vit.ac.in> wrote: >>>> >>>>> Hello Sir, >>>>> >>>>> My introduction : >>>>> I am a 2nd year student studying Computer Science and engineering from >>>>> VIT, Vellore. I work in Google Developers Group VIT. All my experience has >>>>> been about collaborating with a lot of people ,working as a team, building >>>>> products and learning along the way. >>>>> Since scikit-learn is participating this time I am too planning to >>>>> submit a proposal. >>>>> >>>>> Proposal idea: >>>>> I am really interested in implementing kmeans++ algorithm.I was doing >>>>> some work on DT but I found this very appealing. Just wanted to know if it >>>>> can be a good project idea. >>>>> >>>>> Regards >>>>> Shubham Bhardwaj >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> Guillaume Lemaitre >> INRIA Saclay - Ile-de-France >> Equipe PARIETAL >> guillaume.lemaitre at inria.f r --- >> https://glemaitre.github.io/ >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bdholt1 at gmail.com Mon Mar 6 06:46:37 2017 From: bdholt1 at gmail.com (Brian Holt) Date: Mon, 6 Mar 2017 11:46:37 +0000 Subject: [scikit-learn] Scikit-learn survey results In-Reply-To: <23963584-ae33-db28-90d8-6e1479e3f862@gmail.com> References: <23963584-ae33-db28-90d8-6e1479e3f862@gmail.com> Message-ID: Thanks Andy, That's really interesting and gives some hints for future direction. As an initial suggestion, I wonder if incremental decision tree learning would be welcomed by the project? My personal experience building trees was very often frustrated by memory constraints and an alternative that uses batches would allow the technique to scale up to much larger datasets that don't fit in memory. Regards Brian On 5 March 2017 at 17:47, Andreas Mueller wrote: > Hey all. > In case you're interested, here is a summary view of the scikit-learn > survey I posted recently: > https://www.surveymonkey.com/results/SM-RHGZVZ73/ > > tldr; > Preprocessing takes the most time, people want out-of-core learning, > better integration with pandas > and easier visualization of models and data. > People would use automatic machine learning if it was there, but it's not > the highest priority item. > > There is also a lot of interesting info in the comments, but because I was > not able to go through all of them yet, > I don't want to publish them publicly in case there is sensitive > information included (and if anyone knows if there are > legal implications if there wasn't a disclaimer, please let me know). > > Cheers, > Andy > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Mar 6 10:37:01 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 6 Mar 2017 10:37:01 -0500 Subject: [scikit-learn] Scikit-learn survey results In-Reply-To: References: <23963584-ae33-db28-90d8-6e1479e3f862@gmail.com> Message-ID: <711efdaa-388e-681a-a09c-7ecd9a2acb5b@gmail.com> Hi Brian. How about mondrian forests? ;) And I think Gilles has thought about parallelizing trees a bit. It's definitely something that people are interested in. Andy On 03/06/2017 06:46 AM, Brian Holt wrote: > Thanks Andy, > > That's really interesting and gives some hints for future direction. > As an initial suggestion, I wonder if incremental decision tree > learning would be welcomed by the project? My personal experience > building trees was very often frustrated by memory constraints and an > alternative that uses batches would allow the technique to scale up to > much larger datasets that don't fit in memory. > > Regards > Brian > > On 5 March 2017 at 17:47, Andreas Mueller > wrote: > > Hey all. > In case you're interested, here is a summary view of the > scikit-learn survey I posted recently: > https://www.surveymonkey.com/results/SM-RHGZVZ73/ > > > tldr; > Preprocessing takes the most time, people want out-of-core > learning, better integration with pandas > and easier visualization of models and data. > People would use automatic machine learning if it was there, but > it's not the highest priority item. > > There is also a lot of interesting info in the comments, but > because I was not able to go through all of them yet, > I don't want to publish them publicly in case there is sensitive > information included (and if anyone knows if there are > legal implications if there wasn't a disclaimer, please let me know). > > Cheers, > Andy > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From betatim at gmail.com Mon Mar 6 10:43:30 2017 From: betatim at gmail.com (Tim Head) Date: Mon, 06 Mar 2017 15:43:30 +0000 Subject: [scikit-learn] Scikit-learn survey results In-Reply-To: <711efdaa-388e-681a-a09c-7ecd9a2acb5b@gmail.com> References: <23963584-ae33-db28-90d8-6e1479e3f862@gmail.com> <711efdaa-388e-681a-a09c-7ecd9a2acb5b@gmail.com> Message-ID: On Mon, Mar 6, 2017 at 10:37 AM Andreas Mueller wrote: > Hi Brian. > > How about mondrian forests? ;) > Talk to Manoj (CCed) about those. He recently started an implementation while exploring them for scikit-optimize. T -------------- next part -------------- An HTML attachment was scrubbed... URL: From konst.katrioplas at gmail.com Mon Mar 6 11:34:34 2017 From: konst.katrioplas at gmail.com (Konstantinos Katrioplas) Date: Mon, 6 Mar 2017 18:34:34 +0200 Subject: [scikit-learn] contribution to scikit-learn - questions Message-ID: <2f3910b8-dd14-0980-d174-daa0d663a15e@gmail.com> Hello all, My name is Konstantinos and I would like to contribute to scikit-learn. I am relatively new to open source development and I want to work on some easy bug-fixing to get used to the github workflow. Firstly, is this issue open and should I try working on it? https://github.com/scikit-learn/scikit-learn/issues/8425 If not, would you suggest another? Furthermore, when trying to build with make I get this: make: nosetests: Command not found Makefile:32: recipe for target 'test-code' failed make: *** [test-code] Error 127 Is this in any way expected and do you know what I might be missing? Finally, is there an IRC channel particularly for scikit-learn? Thanks in advance, Konstantinos From t3kcit at gmail.com Mon Mar 6 11:42:35 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 6 Mar 2017 11:42:35 -0500 Subject: [scikit-learn] contribution to scikit-learn - questions In-Reply-To: <2f3910b8-dd14-0980-d174-daa0d663a15e@gmail.com> References: <2f3910b8-dd14-0980-d174-daa0d663a15e@gmail.com> Message-ID: Hi Konstantinos. There is an IRC channel but it's not that busy any more. You could try the gitter channel at http://gitter.im/scikit-learn/scikit-learn The issue that you cited is ok, but this one might be easier to start with: https://github.com/scikit-learn/scikit-learn/issues/8194 You need to install nosetests to run it. Andy On 03/06/2017 11:34 AM, Konstantinos Katrioplas wrote: > Hello all, > > My name is Konstantinos and I would like to contribute to > scikit-learn. I am relatively new to open source development and I > want to work on some easy bug-fixing to get used to the github workflow. > > Firstly, is this issue open and should I try working on it? > https://github.com/scikit-learn/scikit-learn/issues/8425 If not, > would you suggest another? > > Furthermore, when trying to build with make I get this: > > make: nosetests: Command not found > Makefile:32: recipe for target 'test-code' failed > make: *** [test-code] Error 127 > > Is this in any way expected and do you know what I might be missing? > > Finally, is there an IRC channel particularly for scikit-learn? > > Thanks in advance, > Konstantinos > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From seralouk at gmail.com Tue Mar 7 04:48:27 2017 From: seralouk at gmail.com (Serafeim Loukas) Date: Tue, 7 Mar 2017 10:48:27 +0100 Subject: [scikit-learn] Linear Discriminant Analysis with Cross Validation in Python Message-ID: Dear scikit members, I would like to ask if there is any function that implements Linear Discriminant Analysis with Cross Validation (leave one out). Thank you in advance, S -------------- next part -------------- An HTML attachment was scrubbed... URL: From maheshak04 at gmail.com Tue Mar 7 04:56:24 2017 From: maheshak04 at gmail.com (Mahesh Kulkarni) Date: Tue, 7 Mar 2017 15:26:24 +0530 Subject: [scikit-learn] Linear Discriminant Analysis with Cross Validation in Python In-Reply-To: References: Message-ID: Yes. Please see following link: http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html On Tue, Mar 7, 2017 at 3:18 PM, Serafeim Loukas wrote: > Dear scikit members, > > > I would like to ask if there is any function that implements > Linear Discriminant Analysis with Cross Validation (leave one out). > > Thank you in advance, > S > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tomarshubham24 at gmail.com Tue Mar 7 08:24:07 2017 From: tomarshubham24 at gmail.com (Shubham Singh Tomar) Date: Tue, 7 Mar 2017 18:54:07 +0530 Subject: [scikit-learn] Error while using GridSearchCV. Message-ID: Hi, I'm trying to use GridSearchCV to tune the parameters for DecisionTreeRegressor. I'm using sklearn 0.18.1 I'm getting the following error: ---------------------------------------------------------------------------TypeError Traceback (most recent call last) in () 1 # Fit the training data to the model using grid search----> 2 reg = fit_model(X_train, y_train) 3 4 # Produce the value for 'max_depth' 5 print "Parameter 'max_depth' is {} for the optimal model.".format(reg.get_params()['max_depth']) in fit_model(X, y) 11 12 # Create cross-validation sets from the training data---> 13 cv_sets = ShuffleSplit(X.shape[0], n_splits = 10, test_size = 0.20, random_state = 0) 14 15 # TODO: Create a decision tree regressor object TypeError: __init__() got multiple values for keyword argument 'n_splits' -- *Thanks,* *Shubham Singh Tomar* *Autodidact24.github.io * -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at gmail.com Tue Mar 7 08:43:00 2017 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Tue, 7 Mar 2017 14:43:00 +0100 Subject: [scikit-learn] Error while using GridSearchCV. In-Reply-To: References: Message-ID: <58BEB8E4.9060909@gmail.com> Shubham, the definition of ShuffleSplit.__init__ is ShuffleSplit(n_splits=10, test_size=0.1, train_size=None, random_state=None) you are passing the n_split parameter twice (once named and once as the first parameter), as the exception that you getting says, -- Roman On 07/03/17 14:24, Shubham Singh Tomar wrote: > Hi, > > I'm trying to use GridSearchCV to tune the parameters for > DecisionTreeRegressor. I'm using sklearn 0.18.1 > > I'm getting the following error: > > --------------------------------------------------------------------------- > TypeError Traceback (most recent call last) > in () 1 # Fit the training data to the model using grid search----> > 2reg = fit_model(X_train, y_train)3 4 # Produce the value for > 'max_depth'5 print "Parameter 'max_depth' is {} for the optimal > model.".format(reg.get_params()['max_depth']) > in fit_model(X, y) 11 12 # Create cross-validation sets from the > training data---> 13cv_sets = ShuffleSplit(X.shape[0], n_splits = 10, > test_size = 0.20, random_state = 0)14 15 # TODO: Create a decision tree > regressor objectTypeError: __init__() got multiple values for keyword > argument 'n_splits' > > > > > -- > *Thanks,* > *Shubham Singh Tomar* > *Autodidact24.github.io * > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From seralouk at gmail.com Tue Mar 7 10:01:09 2017 From: seralouk at gmail.com (Serafeim Loukas) Date: Tue, 7 Mar 2017 16:01:09 +0100 Subject: [scikit-learn] Linear Discriminant Analysis with Cross Validation in Python In-Reply-To: References: Message-ID: Dear Mahesh, Thank you for your response. I read the documentation however I did not find anything related to cross-validation (leave one out). Can you give me a hint? Thank you, S ............................................. Loukas Serafeim University of Geneva email: seralouk at gmail.com 2017-03-07 10:56 GMT+01:00 Mahesh Kulkarni : > Yes. Please see following link: > > http://scikit-learn.org/stable/modules/generated/ > sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html > > On Tue, Mar 7, 2017 at 3:18 PM, Serafeim Loukas > wrote: > >> Dear scikit members, >> >> >> I would like to ask if there is any function that implements >> Linear Discriminant Analysis with Cross Validation (leave one out). >> >> Thank you in advance, >> S >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Tue Mar 7 11:56:55 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Tue, 7 Mar 2017 11:56:55 -0500 Subject: [scikit-learn] Linear Discriminant Analysis with Cross Validation in Python In-Reply-To: References: Message-ID: Hi, Loukas and Mahesh, for LOOCV, you could e.g., use the LeaveOneOut class ``` from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.model_selection import LeaveOneOut loo = LeaveOneOut() lda = LinearDiscriminantAnalysis() test_fold_predictions = [] for train_index, test_index in loo.split(X): X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] lda.fit(X_train, y_train) test_fold_predictions.append(lda.predict(X_test)) ``` or you could pass the loo to the cross_val_score function directly: ``` from sklearn.model_selection import cross_val_score cross_val_score(estimator=lda, X=X, y=y, cv=loo) ``` Best, Sebastian > On Mar 7, 2017, at 10:01 AM, Serafeim Loukas wrote: > > Dear Mahesh, > > Thank you for your response. > > I read the documentation however I did not find anything related to cross-validation (leave one out). > Can you give me a hint? > > Thank you, > S > > ............................................. > Loukas Serafeim > University of Geneva > email: seralouk at gmail.com > > > 2017-03-07 10:56 GMT+01:00 Mahesh Kulkarni : > Yes. Please see following link: > > http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html > > On Tue, Mar 7, 2017 at 3:18 PM, Serafeim Loukas wrote: > Dear scikit members, > > > I would like to ask if there is any function that implements Linear Discriminant Analysis with Cross Validation (leave one out). > > Thank you in advance, > S > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From fernando.wittmann at gmail.com Tue Mar 7 16:02:26 2017 From: fernando.wittmann at gmail.com (Fernando Marcos Wittmann) Date: Tue, 7 Mar 2017 18:02:26 -0300 Subject: [scikit-learn] Error while using GridSearchCV. In-Reply-To: References: Message-ID: Hey Shubham, I am a project reviewer at Udacity. This code seems to be part of one of our projects (P1 - Boston Housing ). I think that you have updated the old module sklearn.cross_validation to the module sklearn.model_detection, is that correct? If yes, then you should also update the parameters in ShuffleSplit to match with this new version (check the docs ). Try to update ShuffleSplit to the following line of code: cv_sets = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0) I hope that helps! Feel free to send me a PM. On Tue, Mar 7, 2017 at 10:24 AM, Shubham Singh Tomar < tomarshubham24 at gmail.com> wrote: > Hi, > > I'm trying to use GridSearchCV to tune the parameters for > DecisionTreeRegressor. I'm using sklearn 0.18.1 > > I'm getting the following error: > > ---------------------------------------------------------------------------TypeError Traceback (most recent call last) in () 1 # Fit the training data to the model using grid search----> 2 reg = fit_model(X_train, y_train) 3 4 # Produce the value for 'max_depth' 5 print "Parameter 'max_depth' is {} for the optimal model.".format(reg.get_params()['max_depth']) > in fit_model(X, y) 11 12 # Create cross-validation sets from the training data---> 13 cv_sets = ShuffleSplit(X.shape[0], n_splits = 10, test_size = 0.20, random_state = 0) 14 15 # TODO: Create a decision tree regressor object > TypeError: __init__() got multiple values for keyword argument 'n_splits' > > > > > -- > *Thanks,* > *Shubham Singh Tomar* > *Autodidact24.github.io * > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Fernando Marcos Wittmann -------------- next part -------------- An HTML attachment was scrubbed... URL: From maheshak04 at gmail.com Tue Mar 7 20:30:54 2017 From: maheshak04 at gmail.com (Mahesh Kulkarni) Date: Wed, 8 Mar 2017 07:00:54 +0530 Subject: [scikit-learn] Linear Discriminant Analysis with Cross Validation in Python In-Reply-To: References: Message-ID: Hi Sebastian, Thank you On 7 Mar 2017 10:28 p.m., "Sebastian Raschka" wrote: > Hi, Loukas and Mahesh, > for LOOCV, you could e.g., use the LeaveOneOut class > > ``` > from sklearn.discriminant_analysis import LinearDiscriminantAnalysis > from sklearn.model_selection import LeaveOneOut > > loo = LeaveOneOut() > lda = LinearDiscriminantAnalysis() > > test_fold_predictions = [] > > for train_index, test_index in loo.split(X): > X_train, X_test = X[train_index], X[test_index] > y_train, y_test = y[train_index], y[test_index] > lda.fit(X_train, y_train) > test_fold_predictions.append(lda.predict(X_test)) > ``` > > or you could pass the loo to the cross_val_score function directly: > > ``` > from sklearn.model_selection import cross_val_score > cross_val_score(estimator=lda, X=X, y=y, cv=loo) > ``` > > > Best, > Sebastian > > > > On Mar 7, 2017, at 10:01 AM, Serafeim Loukas wrote: > > > > Dear Mahesh, > > > > Thank you for your response. > > > > I read the documentation however I did not find anything related to > cross-validation (leave one out). > > Can you give me a hint? > > > > Thank you, > > S > > > > ............................................. > > Loukas Serafeim > > University of Geneva > > email: seralouk at gmail.com > > > > > > 2017-03-07 10:56 GMT+01:00 Mahesh Kulkarni : > > Yes. Please see following link: > > > > http://scikit-learn.org/stable/modules/generated/ > sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html > > > > On Tue, Mar 7, 2017 at 3:18 PM, Serafeim Loukas > wrote: > > Dear scikit members, > > > > > > I would like to ask if there is any function that implements Linear > Discriminant Analysis with Cross Validation (leave one out). > > > > Thank you in advance, > > S > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From seralouk at gmail.com Wed Mar 8 04:16:44 2017 From: seralouk at gmail.com (Serafeim Loukas) Date: Wed, 8 Mar 2017 10:16:44 +0100 Subject: [scikit-learn] Linear Discriminant Analysis with Cross Validation in Python In-Reply-To: References: Message-ID: Dear Sebastian, Thank you for your response. Best, S ............................................. Loukas Serafeim University of Geneva email: seralouk at gmail.com 2017-03-07 17:56 GMT+01:00 Sebastian Raschka : > Hi, Loukas and Mahesh, > for LOOCV, you could e.g., use the LeaveOneOut class > > ``` > from sklearn.discriminant_analysis import LinearDiscriminantAnalysis > from sklearn.model_selection import LeaveOneOut > > loo = LeaveOneOut() > lda = LinearDiscriminantAnalysis() > > test_fold_predictions = [] > > for train_index, test_index in loo.split(X): > X_train, X_test = X[train_index], X[test_index] > y_train, y_test = y[train_index], y[test_index] > lda.fit(X_train, y_train) > test_fold_predictions.append(lda.predict(X_test)) > ``` > > or you could pass the loo to the cross_val_score function directly: > > ``` > from sklearn.model_selection import cross_val_score > cross_val_score(estimator=lda, X=X, y=y, cv=loo) > ``` > > > Best, > Sebastian > > > > On Mar 7, 2017, at 10:01 AM, Serafeim Loukas wrote: > > > > Dear Mahesh, > > > > Thank you for your response. > > > > I read the documentation however I did not find anything related to > cross-validation (leave one out). > > Can you give me a hint? > > > > Thank you, > > S > > > > ............................................. > > Loukas Serafeim > > University of Geneva > > email: seralouk at gmail.com > > > > > > 2017-03-07 10:56 GMT+01:00 Mahesh Kulkarni : > > Yes. Please see following link: > > > > http://scikit-learn.org/stable/modules/generated/ > sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html > > > > On Tue, Mar 7, 2017 at 3:18 PM, Serafeim Loukas > wrote: > > Dear scikit members, > > > > > > I would like to ask if there is any function that implements Linear > Discriminant Analysis with Cross Validation (leave one out). > > > > Thank you in advance, > > S > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From msuzen at gmail.com Wed Mar 8 16:50:53 2017 From: msuzen at gmail.com (Suzen, Mehmet) Date: Wed, 8 Mar 2017 22:50:53 +0100 Subject: [scikit-learn] MAPE in scikit-learn? In-Reply-To: References: Message-ID: Hi Raghav; I suggest forecast package's code if you can read R [*]. A collection of measures related to time-series forecasting would be nice [*] Best, Mehmet [*] https://cran.r-project.org/web/packages/forecast/forecast.pdf From jlopez at ende.cc Sat Mar 11 08:04:57 2017 From: jlopez at ende.cc (=?utf-8?Q?Javier_L=C3=B3pez_Pe=C3=B1a?=) Date: Sat, 11 Mar 2017 13:04:57 +0000 Subject: [scikit-learn] Label encoding for classifiers and soft targets Message-ID: <542B0BDD-F329-4F26-9001-9F535426306C@ende.cc> Hi there! I have been recently experimenting with model regularization through the use of soft targets, and I?d like to be able to play with that from sklearn. The main idea is as follows: imagine I want to fit a (probabilisitic) classifier with three possible targets, 0, 1, 2 If I pass my training set (X, y) to a sklearn classifier, the target vector y gets encoded so that each target becomes an array, [1, 0, 0], [0, 1, 0], or [0, 0, 1] What I would like to do is to be able to pass the targets directly in the encoded form, and avoid any further encoding. This allows for instance to pass targets as [0.9, 0.5, 0.5] if I want to prevent my classifier from becoming too opinionated on its predicted probabilities. Ideally I would like to do something like this: ``` clf = SomeClassifier(*parameters, encode_targets=False) ``` and then call ``` elf.fit(X, encoded_y) ``` Would it be simple to modify sklearn code to do this, or would it require a lot of tinkering such as modifying every single classifier under the sun? Cheers, J From konst.katrioplas at gmail.com Sat Mar 11 08:29:30 2017 From: konst.katrioplas at gmail.com (Konstantinos Katrioplas) Date: Sat, 11 Mar 2017 15:29:30 +0200 Subject: [scikit-learn] issue suggestion - decision trees - GSoC Message-ID: <33a3a5bf-37dd-1cad-c4ae-ef4b62294a8c@gmail.com> Hello all, While I am waiting for the PR that I have submitted to be evaluated (https://github.com/scikit-learn/scikit-learn/pull/8563), would you suggest another (easy) issue for me to work on? Ideally something for which I will write some substantial code, so as to present it in my application for GSoC? Is anyone interested to mentor me in the parallelization of decision trees? I admit I am not yet really familiar with the current tree code (although I have been using the method for regression on a research project) but I am very much intrigued by the idea and willing to learn all about it until the summer. Regards, Konstantinos From gborad at gmail.com Sun Mar 12 01:38:20 2017 From: gborad at gmail.com (Gautam Borad) Date: Sun, 12 Mar 2017 12:08:20 +0530 Subject: [scikit-learn] scikit-learn Digest, Vol 12, Issue 18 In-Reply-To: References: Message-ID: On 11 Mar 2017 22:32, wrote: > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Label encoding for classifiers and soft targets > (Javier L?pez Pe?a) > 2. issue suggestion - decision trees - GSoC (Konstantinos Katrioplas) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sat, 11 Mar 2017 13:04:57 +0000 > From: Javier L?pez Pe?a > To: scikit-learn at python.org > Subject: [scikit-learn] Label encoding for classifiers and soft > targets > Message-ID: <542B0BDD-F329-4F26-9001-9F535426306C at ende.cc> > Content-Type: text/plain; charset=utf-8 > > Hi there! > > I have been recently experimenting with model regularization through the > use of soft targets, > and I?d like to be able to play with that from sklearn. > > The main idea is as follows: imagine I want to fit a (probabilisitic) > classifier with three possible > targets, 0, 1, 2 > > If I pass my training set (X, y) to a sklearn classifier, the target > vector y gets encoded so that > each target becomes an array, [1, 0, 0], [0, 1, 0], or [0, 0, 1] > > What I would like to do is to be able to pass the targets directly in the > encoded form, and avoid > any further encoding. This allows for instance to pass targets as [0.9, > 0.5, 0.5] if I want to prevent > my classifier from becoming too opinionated on its predicted probabilities. > > Ideally I would like to do something like this: > ``` > clf = SomeClassifier(*parameters, encode_targets=False) > ``` > > and then call > ``` > elf.fit(X, encoded_y) > ``` > > Would it be simple to modify sklearn code to do this, or would it require > a lot of tinkering > such as modifying every single classifier under the sun? > > Cheers, > J > > ------------------------------ > > Message: 2 > Date: Sat, 11 Mar 2017 15:29:30 +0200 > From: Konstantinos Katrioplas > To: scikit-learn at python.org > Subject: [scikit-learn] issue suggestion - decision trees - GSoC > Message-ID: <33a3a5bf-37dd-1cad-c4ae-ef4b62294a8c at gmail.com> > Content-Type: text/plain; charset=utf-8; format=flowed > > Hello all, > > While I am waiting for the PR that I have submitted to be evaluated > (https://github.com/scikit-learn/scikit-learn/pull/8563), would you > suggest another (easy) issue for me to work on? Ideally something for > which I will write some substantial code, so as to present it in my > application for GSoC? > > Is anyone interested to mentor me in the parallelization of decision > trees? I admit I am not yet really familiar with the current tree code > (although I have been using the method for regression on a research > project) but I am very much intrigued by the idea and willing to learn > all about it until the summer. > > Regards, > Konstantinos > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 12, Issue 18 > ******************************************** > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tomarshubham24 at gmail.com Sun Mar 12 09:59:35 2017 From: tomarshubham24 at gmail.com (Shubham Singh Tomar) Date: Sun, 12 Mar 2017 19:29:35 +0530 Subject: [scikit-learn] Error while using GridSearchCV. In-Reply-To: References: Message-ID: Hi, guys! Thanks for the responses. @Fernando: Yes, this code is, in fact, part of Udacity's Boston Housing project. I'm currently working on my MLE Nanodegree. I was able to modify the code to go with *sklearn.model_selection*, as you suggested. And, it's great to see you help Udacity students here as well :) Do you think we should update the code and project description in main Udacity repository to support the newer sklearn versions? On Wed, Mar 8, 2017 at 2:32 AM, Fernando Marcos Wittmann < fernando.wittmann at gmail.com> wrote: > Hey Shubham, > > I am a project reviewer at Udacity. This code seems to be part of one of > our projects (P1 - Boston Housing > ). > I think that you have updated the old module sklearn.cross_validation to > the module sklearn.model_detection, is that correct? If yes, then you > should also update the parameters in ShuffleSplit to match with this new > version (check the docs > ). > Try to update ShuffleSplit to the following line of code: > > cv_sets = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0) > > I hope that helps! Feel free to send me a PM. > > > On Tue, Mar 7, 2017 at 10:24 AM, Shubham Singh Tomar < > tomarshubham24 at gmail.com> wrote: > >> Hi, >> >> I'm trying to use GridSearchCV to tune the parameters for >> DecisionTreeRegressor. I'm using sklearn 0.18.1 >> >> I'm getting the following error: >> >> ---------------------------------------------------------------------------TypeError Traceback (most recent call last) in () 1 # Fit the training data to the model using grid search----> 2 reg = fit_model(X_train, y_train) 3 4 # Produce the value for 'max_depth' 5 print "Parameter 'max_depth' is {} for the optimal model.".format(reg.get_params()['max_depth']) >> in fit_model(X, y) 11 12 # Create cross-validation sets from the training data---> 13 cv_sets = ShuffleSplit(X.shape[0], n_splits = 10, test_size = 0.20, random_state = 0) 14 15 # TODO: Create a decision tree regressor object >> TypeError: __init__() got multiple values for keyword argument 'n_splits' >> >> >> >> >> -- >> *Thanks,* >> *Shubham Singh Tomar* >> *Autodidact24.github.io * >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > > Fernando Marcos Wittmann > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- *Thanks,* *Shubham Singh Tomar* *Autodidact24.github.io * -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Sun Mar 12 14:38:44 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Sun, 12 Mar 2017 19:38:44 +0100 Subject: [scikit-learn] Label encoding for classifiers and soft targets In-Reply-To: <542B0BDD-F329-4F26-9001-9F535426306C@ende.cc> References: <542B0BDD-F329-4F26-9001-9F535426306C@ende.cc> Message-ID: <20170312183844.GD694569@phare.normalesup.org> > Would it be simple to modify sklearn code to do this, or would it require a lot of tinkering > such as modifying every single classifier under the sun? You can use sample weights to go a bit in this direction. But in general, the mathematical meaning of your intuitions will depend on the classifier, so they will not be general ways of implementing them without a lot of tinkering. From jlopez at ende.cc Sun Mar 12 15:11:02 2017 From: jlopez at ende.cc (=?utf-8?Q?Javier_L=C3=B3pez_Pe=C3=B1a?=) Date: Sun, 12 Mar 2017 19:11:02 +0000 Subject: [scikit-learn] Label encoding for classifiers and soft targets In-Reply-To: <20170312183844.GD694569@phare.normalesup.org> References: <542B0BDD-F329-4F26-9001-9F535426306C@ende.cc> <20170312183844.GD694569@phare.normalesup.org> Message-ID: <72559155-CB35-441E-9F9D-6FD679033E17@ende.cc> > On 12 Mar 2017, at 18:38, Gael Varoquaux wrote: > > You can use sample weights to go a bit in this direction. But in general, > the mathematical meaning of your intuitions will depend on the > classifier, so they will not be general ways of implementing them without > a lot of tinkering. I see? to be honest for my purposes it would be enough to bypass the target binarization for the MLP classifier, so maybe I will just fork my own copy of that class for this. The purpose is two-fold, on the one hand use the probabilities generated by a very complex model (e.g. a massive ensemble) to train a simpler one that achieves comparable performance at a fraction of the cost. Any universal classifier will do (neural networks are the prime example). The second purpose it to use classes probabilities instead of observed classes at training time. In some problems this helps with model regularization (see section 6 of [1]) Cheers, J [1] https://arxiv.org/pdf/1503.02531v1.pdf -------------- next part -------------- An HTML attachment was scrubbed... URL: From fastier at linkedin.com Sun Mar 12 22:07:22 2017 From: fastier at linkedin.com (Frank Astier) Date: Sun, 12 Mar 2017 19:07:22 -0700 Subject: [scikit-learn] Differences between scikit-learn and Spark.ml for regression toy problem Message-ID: (this was also posted to stackoverflow on 03/10) I am setting up a very simple logistic regression problem in scikit-learn and in spark.ml, and the results diverge: the models they learn are different, but I can't figure out why (data is the same, model type is the same, regularization is the same...). No doubt I am missing some setting on one side or the other. Which setting? How should I set up either scikit or spark.ml to find the same model as its counterpart? I give the sklearn code and spark.ml code below. Both should be ready to cut-and-paste and run. scikit-learn code: ---------------------- import numpy as np from sklearn.linear_model import LogisticRegression, Ridge X = np.array([ [-0.7306653538519616, 0.0], [0.6750417712898752, -0.4232874171873786], [0.1863463229359709, -0.8163423997075965], [-0.6719842051493347, 0.0], [0.9699938346531928, 0.0], [0.22759406190283604, 0.0], [0.9688721028330911, 0.0], [0.5993795346650845, 0.0], [0.9219423508390701, -0.8972778242305388], [0.7006904841584055, -0.5607635619919824] ]) y = np.array([ 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0 ]) m, n = X.shape # Add intercept term to simulate inputs to GameEstimator X_with_intercept = np.hstack((X, np.ones(m)[:,np.newaxis])) l = 0.3 e = LogisticRegression( fit_intercept=False, penalty='l2', C=1/l, max_iter=100, tol=1e-11) e.fit(X_with_intercept, y) print e.coef_ # => [[ 0.98662189 0.45571052 -0.23467255]] # Linear regression is called Ridge in sklearn e = Ridge( fit_intercept=False, alpha=l, max_iter=100, tol=1e-11) e.fit(X_with_intercept, y) print e.coef_ # =>[ 0.32155545 0.17904355 0.41222418] spark.ml code: ------------------- import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.regression.LinearRegression import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.sql.SQLContext object TestSparkRegression { def main(args: Array[String]): Unit = { import org.apache.log4j.{Level, Logger} Logger.getLogger("org").setLevel(Level.OFF) Logger.getLogger("akka").setLevel(Level.OFF) val conf = new SparkConf().setAppName("test").setMaster("local") val sc = new SparkContext(conf) val sparkTrainingData = new SQLContext(sc) .createDataFrame(Seq( LabeledPoint(0.0, Vectors.dense(-0.7306653538519616, 0.0)), LabeledPoint(1.0, Vectors.dense(0.6750417712898752, -0.4232874171873786)), LabeledPoint(1.0, Vectors.dense(0.1863463229359709, -0.8163423997075965)), LabeledPoint(0.0, Vectors.dense(-0.6719842051493347, 0.0)), LabeledPoint(1.0, Vectors.dense(0.9699938346531928, 0.0)), LabeledPoint(1.0, Vectors.dense(0.22759406190283604, 0.0)), LabeledPoint(1.0, Vectors.dense(0.9688721028330911, 0.0)), LabeledPoint(0.0, Vectors.dense(0.5993795346650845, 0.0)), LabeledPoint(0.0, Vectors.dense(0.9219423508390701, -0.8972778242305388)), LabeledPoint(0.0, Vectors.dense(0.7006904841584055, -0.5607635619919824)))) .toDF("label", "features") val logisticModel = new LogisticRegression() .setRegParam(0.3) .setLabelCol("label") .setFeaturesCol("features") .fit(sparkTrainingData) println(s"Spark logistic model coefficients: ${logisticModel.coefficients} Intercept: ${logisticModel.intercept}") // Spark logistic model coefficients: [0.5451588538376263,0.26740606573584713] Intercept: -0.13897955358689987 val linearModel = new LinearRegression() .setRegParam(0.3) .setLabelCol("label") .setFeaturesCol("features") .setSolver("l-bfgs") .fit(sparkTrainingData) println(s"Spark linear model coefficients: ${linearModel.coefficients} Intercept: ${linearModel.intercept}") // Spark linear model coefficients: [0.19852664861346023,0.11501200541407802] Intercept: 0.45464906876832323 sc.stop() } } Thanks, Frank -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.louppe at gmail.com Mon Mar 13 03:43:29 2017 From: g.louppe at gmail.com (Gilles Louppe) Date: Mon, 13 Mar 2017 08:43:29 +0100 Subject: [scikit-learn] Label encoding for classifiers and soft targets In-Reply-To: <72559155-CB35-441E-9F9D-6FD679033E17@ende.cc> References: <542B0BDD-F329-4F26-9001-9F535426306C@ende.cc> <20170312183844.GD694569@phare.normalesup.org> <72559155-CB35-441E-9F9D-6FD679033E17@ende.cc> Message-ID: Hi Javier, In the particular case of tree-based models, you case use the soft labels to create a multi-output regression problem, which would yield an equivalent classifier (one can show that reduction of variance and the gini index would yield the same trees). So basically, reg = RandomForestRegressor() reg.fit(X, encoded_y) should work. Gilles On 12 March 2017 at 20:11, Javier L?pez Pe?a wrote: > > On 12 Mar 2017, at 18:38, Gael Varoquaux > wrote: > > You can use sample weights to go a bit in this direction. But in general, > the mathematical meaning of your intuitions will depend on the > classifier, so they will not be general ways of implementing them without > a lot of tinkering. > > > I see? to be honest for my purposes it would be enough to bypass the target > binarization for > the MLP classifier, so maybe I will just fork my own copy of that class for > this. > > The purpose is two-fold, on the one hand use the probabilities generated by > a very complex > model (e.g. a massive ensemble) to train a simpler one that achieves > comparable performance at a > fraction of the cost. Any universal classifier will do (neural networks are > the prime example). > > The second purpose it to use classes probabilities instead of observed > classes at training time. > In some problems this helps with model regularization (see section 6 of > [1]) > > Cheers, > J > > [1] https://arxiv.org/pdf/1503.02531v1.pdf > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From jlopez at ende.cc Mon Mar 13 08:35:22 2017 From: jlopez at ende.cc (=?utf-8?Q?Javier_L=C3=B3pez_Pe=C3=B1a?=) Date: Mon, 13 Mar 2017 12:35:22 +0000 Subject: [scikit-learn] Label encoding for classifiers and soft targets In-Reply-To: References: <542B0BDD-F329-4F26-9001-9F535426306C@ende.cc> <20170312183844.GD694569@phare.normalesup.org> <72559155-CB35-441E-9F9D-6FD679033E17@ende.cc> Message-ID: Hi Giles, thanks for the suggestion! Training a regression tree would require sticking some kind of probability normaliser at the end to ensure proper probabilities, this might somehow hurt sharpness or calibration. Unfortunately, one of the things I am trying to do with this is moving away from RF and they humongous memory requirements? Anyway, I think I have a fairly good idea on how to modify the MLPClassifier to get what I need. When I get around to do it I?ll drop a line to see if there might be any interest on pushing the code upstream. Cheers, J > On 13 Mar 2017, at 07:43, Gilles Louppe wrote: > > Hi Javier, > > In the particular case of tree-based models, you case use the soft > labels to create a multi-output regression problem, which would yield > an equivalent classifier (one can show that reduction of variance and > the gini index would yield the same trees). > > So basically, > > reg = RandomForestRegressor() > reg.fit(X, encoded_y) > > should work. > > Gilles From stuart at stuartreynolds.net Mon Mar 13 12:57:56 2017 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Mon, 13 Mar 2017 09:57:56 -0700 Subject: [scikit-learn] Logistic regression with elastic net regularization Message-ID: Is there an implementation of logistic regression with elastic net regularization in scikit? (or pointers on implementing this - its seems non-convex and so you might expect poor behavior with some optimizers) - Stuart -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Mon Mar 13 13:04:28 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Mon, 13 Mar 2017 10:04:28 -0700 Subject: [scikit-learn] Logistic regression with elastic net regularization In-Reply-To: References: Message-ID: Hi Stuart Take a look at this issue: https://github.com/scikit-learn/scikit-learn/issues/2968 On Mon, Mar 13, 2017 at 9:57 AM, Stuart Reynolds wrote: > Is there an implementation of logistic regression with elastic net > regularization in scikit? > (or pointers on implementing this - its seems non-convex and so you might > expect poor behavior with some optimizers) > > > - Stuart > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stuart at stuartreynolds.net Mon Mar 13 13:06:08 2017 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Mon, 13 Mar 2017 10:06:08 -0700 Subject: [scikit-learn] Differences between scikit-learn and Spark.ml for regression toy problem In-Reply-To: References: Message-ID: Both libraries are heavily parameterized. You should check what the defaults are for both. Some ideas: - What regularization is being used. L1/L2? - Does the regularization parameter have the same interpretation? 1/C = lambda? Some libraries use C. Some use lambda. - Also, some libraries regularize the intercept (scikit), other do not. (It doesn't seem like a particularly good idea to regularize the intercept if your optimizer permits not doing it). On Sun, Mar 12, 2017 at 7:07 PM, Frank Astier via scikit-learn < scikit-learn at python.org> wrote: > (this was also posted to stackoverflow on 03/10) > > I am setting up a very simple logistic regression problem in scikit-learn > and in spark.ml, and the results diverge: the models they learn are > different, but I can't figure out why (data is the same, model type is the > same, regularization is the same...). > > No doubt I am missing some setting on one side or the other. Which > setting? How should I set up either scikit or spark.ml to find the same > model as its counterpart? > > I give the sklearn code and spark.ml code below. Both should be ready to > cut-and-paste and run. > > scikit-learn code: > ---------------------- > > import numpy as np > from sklearn.linear_model import LogisticRegression, Ridge > > X = np.array([ > [-0.7306653538519616, 0.0], > [0.6750417712898752, -0.4232874171873786], > [0.1863463229359709, -0.8163423997075965], > [-0.6719842051493347, 0.0], > [0.9699938346531928, 0.0], > [0.22759406190283604, 0.0], > [0.9688721028330911, 0.0], > [0.5993795346650845, 0.0], > [0.9219423508390701, -0.8972778242305388], > [0.7006904841584055, -0.5607635619919824] > ]) > > y = np.array([ > 0.0, > 1.0, > 1.0, > 0.0, > 1.0, > 1.0, > 1.0, > 0.0, > 0.0, > 0.0 > ]) > > m, n = X.shape > > # Add intercept term to simulate inputs to GameEstimator > X_with_intercept = np.hstack((X, np.ones(m)[:,np.newaxis])) > > l = 0.3 > e = LogisticRegression( > fit_intercept=False, > penalty='l2', > C=1/l, > max_iter=100, > tol=1e-11) > > e.fit(X_with_intercept, y) > > print e.coef_ > # => [[ 0.98662189 0.45571052 -0.23467255]] > > # Linear regression is called Ridge in sklearn > e = Ridge( > fit_intercept=False, > alpha=l, > max_iter=100, > tol=1e-11) > > e.fit(X_with_intercept, y) > > print e.coef_ > # =>[ 0.32155545 0.17904355 0.41222418] > > spark.ml code: > ------------------- > > import org.apache.spark.{SparkConf, SparkContext} > import org.apache.spark.ml.classification.LogisticRegression > import org.apache.spark.ml.regression.LinearRegression > import org.apache.spark.mllib.linalg.Vectors > import org.apache.spark.mllib.regression.LabeledPoint > import org.apache.spark.sql.SQLContext > > object TestSparkRegression { > def main(args: Array[String]): Unit = { > import org.apache.log4j.{Level, Logger} > > Logger.getLogger("org").setLevel(Level.OFF) > Logger.getLogger("akka").setLevel(Level.OFF) > > val conf = new SparkConf().setAppName("test").setMaster("local") > val sc = new SparkContext(conf) > > val sparkTrainingData = new SQLContext(sc) > .createDataFrame(Seq( > LabeledPoint(0.0, Vectors.dense(-0.7306653538519616, 0.0)), > LabeledPoint(1.0, Vectors.dense(0.6750417712898752, > -0.4232874171873786)), > LabeledPoint(1.0, Vectors.dense(0.1863463229359709, > -0.8163423997075965)), > LabeledPoint(0.0, Vectors.dense(-0.6719842051493347, 0.0)), > LabeledPoint(1.0, Vectors.dense(0.9699938346531928, 0.0)), > LabeledPoint(1.0, Vectors.dense(0.22759406190283604, 0.0)), > LabeledPoint(1.0, Vectors.dense(0.9688721028330911, 0.0)), > LabeledPoint(0.0, Vectors.dense(0.5993795346650845, 0.0)), > LabeledPoint(0.0, Vectors.dense(0.9219423508390701, > -0.8972778242305388)), > LabeledPoint(0.0, Vectors.dense(0.7006904841584055, > -0.5607635619919824)))) > .toDF("label", "features") > > val logisticModel = new LogisticRegression() > .setRegParam(0.3) > .setLabelCol("label") > .setFeaturesCol("features") > .fit(sparkTrainingData) > > println(s"Spark logistic model coefficients: > ${logisticModel.coefficients} Intercept: ${logisticModel.intercept}") > // Spark logistic model coefficients: [0.5451588538376263,0.26740606573584713] > Intercept: -0.13897955358689987 > > val linearModel = new LinearRegression() > .setRegParam(0.3) > .setLabelCol("label") > .setFeaturesCol("features") > .setSolver("l-bfgs") > .fit(sparkTrainingData) > > println(s"Spark linear model coefficients: > ${linearModel.coefficients} Intercept: ${linearModel.intercept}") > // Spark linear model coefficients: [0.19852664861346023,0.11501200541407802] > Intercept: 0.45464906876832323 > > sc.stop() > } > } > > Thanks, > > Frank > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Mon Mar 13 13:07:55 2017 From: g.lemaitre58 at gmail.com (Guillaume Lemaitre) Date: Mon, 13 Mar 2017 18:07:55 +0100 Subject: [scikit-learn] Logistic regression with elastic net regularization In-Reply-To: (Stuart Reynolds's message of "Mon, 13 Mar 2017 09:57:56 -0700") References: Message-ID: <874lyx9kb8.fsf@gmail.com> Recently, there are some issues/PRs tackling the topic: https://github.com/scikit-learn/scikit-learn/issues/8288 https://github.com/scikit-learn/scikit-learn/issues/8446 From stuart at stuartreynolds.net Mon Mar 13 13:07:57 2017 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Mon, 13 Mar 2017 10:07:57 -0700 Subject: [scikit-learn] Logistic regression with elastic net regularization In-Reply-To: References: Message-ID: Perfect. Thanks -- will give it a go. On Mon, Mar 13, 2017 at 10:04 AM, Jacob Schreiber wrote: > Hi Stuart > > Take a look at this issue: https://github.com/scikit-learn/scikit-learn/ > issues/2968 > > On Mon, Mar 13, 2017 at 9:57 AM, Stuart Reynolds < > stuart at stuartreynolds.net> wrote: > >> Is there an implementation of logistic regression with elastic net >> regularization in scikit? >> (or pointers on implementing this - its seems non-convex and so you might >> expect poor behavior with some optimizers) >> >> >> - Stuart >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Mon Mar 13 13:08:07 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Mon, 13 Mar 2017 13:08:07 -0400 Subject: [scikit-learn] Logistic regression with elastic net regularization In-Reply-To: References: Message-ID: <98AA67A8-71D6-402C-8F99-5CAB64D28525@gmail.com> Hi, Stuart, I think the only way to do that right now would be through the SGD classifier, e.g., sklearn.linear_model.SGDClassifier(loss='log', penalty='elasticnet' ?) Best, Sebastian > On Mar 13, 2017, at 12:57 PM, Stuart Reynolds wrote: > > Is there an implementation of logistic regression with elastic net regularization in scikit? > (or pointers on implementing this - its seems non-convex and so you might expect poor behavior with some optimizers) > > > - Stuart > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From t3kcit at gmail.com Mon Mar 13 17:17:22 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 13 Mar 2017 17:17:22 -0400 Subject: [scikit-learn] Label encoding for classifiers and soft targets In-Reply-To: <72559155-CB35-441E-9F9D-6FD679033E17@ende.cc> References: <542B0BDD-F329-4F26-9001-9F535426306C@ende.cc> <20170312183844.GD694569@phare.normalesup.org> <72559155-CB35-441E-9F9D-6FD679033E17@ende.cc> Message-ID: On 03/12/2017 03:11 PM, Javier L?pez Pe?a wrote: > The purpose is two-fold, on the one hand use the probabilities > generated by a very complex > model (e.g. a massive ensemble) to train a simpler one that achieves > comparable performance at a > fraction of the cost. Any universal classifier will do (neural > networks are the prime example). You could use a regression model with a logistic sigmoid in the output layer. From t3kcit at gmail.com Mon Mar 13 17:18:33 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 13 Mar 2017 17:18:33 -0400 Subject: [scikit-learn] Label encoding for classifiers and soft targets In-Reply-To: References: <542B0BDD-F329-4F26-9001-9F535426306C@ende.cc> <20170312183844.GD694569@phare.normalesup.org> <72559155-CB35-441E-9F9D-6FD679033E17@ende.cc> Message-ID: <0520bb5d-6d1c-e14e-ed26-fcef4725d167@gmail.com> On 03/13/2017 08:35 AM, Javier L?pez Pe?a wrote: > Training a regression tree would require sticking some kind of > probability normaliser at the end to ensure proper probabilities, > this might somehow hurt sharpness or calibration. No, if all the samples are normalized and your aggregation function is sane (like the mean), the output will also be normalized. From jlopez at ende.cc Mon Mar 13 17:54:24 2017 From: jlopez at ende.cc (=?windows-1252?Q?Javier_L=F3pez_Pe=F1a?=) Date: Mon, 13 Mar 2017 21:54:24 +0000 Subject: [scikit-learn] Label encoding for classifiers and soft targets In-Reply-To: References: <542B0BDD-F329-4F26-9001-9F535426306C@ende.cc> <20170312183844.GD694569@phare.normalesup.org> <72559155-CB35-441E-9F9D-6FD679033E17@ende.cc> Message-ID: > You could use a regression model with a logistic sigmoid in the output layer. By training a regression network with logistic activation the outputs do not add to 1. I just checked on a minimal example on the iris dataset. From jlopez at ende.cc Mon Mar 13 17:56:14 2017 From: jlopez at ende.cc (=?utf-8?Q?Javier_L=C3=B3pez_Pe=C3=B1a?=) Date: Mon, 13 Mar 2017 21:56:14 +0000 Subject: [scikit-learn] Label encoding for classifiers and soft targets In-Reply-To: <0520bb5d-6d1c-e14e-ed26-fcef4725d167@gmail.com> References: <542B0BDD-F329-4F26-9001-9F535426306C@ende.cc> <20170312183844.GD694569@phare.normalesup.org> <72559155-CB35-441E-9F9D-6FD679033E17@ende.cc> <0520bb5d-6d1c-e14e-ed26-fcef4725d167@gmail.com> Message-ID: <4D74F250-C79C-4900-8670-5C420B620C2B@ende.cc> > On 13 Mar 2017, at 21:18, Andreas Mueller wrote: > > No, if all the samples are normalized and your aggregation function is sane (like the mean), the output will also be normalised. You are completely right, I hadn?t checked this for random forests. Still, my purpose is to reduce model complexity, and RF require too much memory to be used in my production environment. From ssaligra at hawk.iit.edu Mon Mar 13 18:29:10 2017 From: ssaligra at hawk.iit.edu (Shreyas Saligrama chandrakan) Date: Mon, 13 Mar 2017 15:29:10 -0700 Subject: [scikit-learn] GSoc, 2017 (proposal idea and intro) .reg In-Reply-To: References: Message-ID: Hi, Is it possible for me to contribute a library to introduce SVM's with tree kernel (like current available one in svmlight) which is currently missing in scikit-learn? Best, Shreyas On 5 Mar 2017 11:03 a.m., "Andreas Mueller" wrote: > There was a PR here: > https://github.com/scikit-learn/scikit-learn/pull/5530 > > but it didn't seem to work. Feel free to convince us otherwise ;) > > > On 03/02/2017 08:23 PM, SHUBHAM BHARDWAJ 15BCE0704 wrote: > > Hello Sir, > Very Sorry for the numbers I saw this written in the comments.I assumed > -Given the person who suggested the paper might have taken a look into the > number of citations.I will make sure to personally check myself. > > Regards > Shubham Bhardwaj > > On Fri, Mar 3, 2017 at 6:40 AM, Guillaume Lema?tre > wrote: > >> I think that you mean this paper -> Scalable K-Means++ -> 218 citations >> >> On 3 March 2017 at 02:00, SHUBHAM BHARDWAJ 15BCE0704 < >> shubham.bhardwaj2015 at vit.ac.in> wrote: >> >>> Hello Sir, >>> >>> Thanks a lot for the reply. Sorry for not being elaborate about what I >>> was trying to address. I wanted to implement this [ >>> http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf] (1200+citations)- >>> mentioned in comments. This pertains to the stalled issue #4357 .Proposal >>> idea - implementing a scalable kmeans++. >>> >>> Regards >>> Shubham Bhardwaj >>> >>> On Fri, Mar 3, 2017 at 12:01 AM, Jacob Schreiber < >>> jmschreiber91 at gmail.com> wrote: >>> >>>> Hi Shubham >>>> >>>> Thanks for your interest. I'm eager to see your contributions to >>>> sklearn in the future. However, I'm pretty sure kmeans++ is already >>>> implemented: http://scikit-learn.org/stable/modules/generate >>>> d/sklearn.cluster.KMeans.html >>>> >>>> Jacob >>>> >>>> On Thu, Mar 2, 2017 at 1:07 AM, SHUBHAM BHARDWAJ 15BCE0704 < >>>> shubham.bhardwaj2015 at vit.ac.in> wrote: >>>> >>>>> Hello Sir, >>>>> >>>>> My introduction : >>>>> I am a 2nd year student studying Computer Science and engineering from >>>>> VIT, Vellore. I work in Google Developers Group VIT. All my experience has >>>>> been about collaborating with a lot of people ,working as a team, building >>>>> products and learning along the way. >>>>> Since scikit-learn is participating this time I am too planning to >>>>> submit a proposal. >>>>> >>>>> Proposal idea: >>>>> I am really interested in implementing kmeans++ algorithm.I was doing >>>>> some work on DT but I found this very appealing. Just wanted to know if it >>>>> can be a good project idea. >>>>> >>>>> Regards >>>>> Shubham Bhardwaj >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> Guillaume Lemaitre >> INRIA Saclay - Ile-de-France >> Equipe PARIETAL >> guillaume.lemaitre at inria.f r --- >> https://glemaitre.github.io/ >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stuart at stuartreynolds.net Tue Mar 14 12:17:14 2017 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Tue, 14 Mar 2017 09:17:14 -0700 Subject: [scikit-learn] Logistic regression with elastic net regularization In-Reply-To: <98AA67A8-71D6-402C-8F99-5CAB64D28525@gmail.com> References: <98AA67A8-71D6-402C-8F99-5CAB64D28525@gmail.com> Message-ID: Many thanks. On Mon, Mar 13, 2017 at 10:08 AM, Sebastian Raschka wrote: > Hi, Stuart, > I think the only way to do that right now would be through the SGD > classifier, e.g., > > sklearn.linear_model.SGDClassifier(loss='log', penalty='elasticnet' ?) > > Best, > Sebastian > > > On Mar 13, 2017, at 12:57 PM, Stuart Reynolds > wrote: > > > > Is there an implementation of logistic regression with elastic net > regularization in scikit? > > (or pointers on implementing this - its seems non-convex and so you > might expect poor behavior with some optimizers) > > > > > > - Stuart > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Tue Mar 14 16:39:39 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Tue, 14 Mar 2017 21:39:39 +0100 Subject: [scikit-learn] Logistic regression with elastic net regularization In-Reply-To: References: <98AA67A8-71D6-402C-8F99-5CAB64D28525@gmail.com> Message-ID: Note that SGD is not very good at optimizing finely with a non-smooth penalty (e.g. l1 or elasticnet). The future SAGA solver is going to be much better at finding the optimal sparsity support (although this support is not guaranteed to be stable across re-sampling of the training set if the training set is small). -- Olivier From olivier.grisel at ensta.org Tue Mar 14 16:41:29 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Tue, 14 Mar 2017 21:41:29 +0100 Subject: [scikit-learn] Logistic regression with elastic net regularization In-Reply-To: References: <98AA67A8-71D6-402C-8F99-5CAB64D28525@gmail.com> Message-ID: >From a generalization point of view (test accuracy), the optimal sparsity support should not matter much though, but it can be helpful to find a the optimally sparsest solution for either computational constraints (smaller models with a lower prediction latency) and interpretation of the weights (domain specific). -- Olivier From karandesai281196 at gmail.com Wed Mar 15 04:48:28 2017 From: karandesai281196 at gmail.com (Karan Desai) Date: Wed, 15 Mar 2017 14:18:28 +0530 Subject: [scikit-learn] [GSoC 2017] First Draft, request for suggestions - Improve Online Learning of Linear Models. Message-ID: Hello developers, I'm Karan Desai, an Electrical Engineering Undergraduate at IIT Roorkee. I was following the community since October and initially planned to work on Pytest Migration idea. But on meticulous discussions, it was concluded that the migration task might be short for a three month wide timeline. Besides work is in progress on that. I particularly found the first project idea appealing, and went about gathering ingredients to make the perfect recipe for summers. Finally I can outline it as stated below. The description was quite short, so I will be happy to include more in it if need be. 1. There's a gradient descent optimizer, but I could not find an optimizer for adaptive learning strategies (I saw a method for adam in MLP though). So adding that can be a part of my project. 2. I looked into benchmarks directory, and checked a comparison of SGD against coordinate descent and ridge regression. Similar type of benching should be done with this new Optimizer/s as well. 3. There's a lack of multinomial logloss as mentioned in description (categorical cross entropy for classification tasks). I can work on adding that as well. As an addition, I can work on KL divergence, poisson and cosine proximity losses, to name a few. In my opinion, these are pretty standard and can be a nice to have. They already exist as metrics, just need to be ported to Cython and used as an optimization objective for linear classifiers. 4. About a tool to anneal learning rate: I suggest a new approach to look at this - as a callback. I searched through the documentation and I could not find this way of handling tidbits during training of models. We should be able to provide a callback to the constructor of a linear model which can do any dedicated job after every epoch, be it learning rate annealing, saving model checkpoint, getting custom verbose output, or as creative as uploading data to server for real time plots on any website. If this gets working in place, we can generalize this to many classes of scikit-learn. As a part of my project, I am planning to enrich scikit-learn to be shipping some ready made callback helpers for easy plug and play. I am still not sure whether this is sufficient for a three months timeline, because I am assuming the review cycles might take slightly longer time because of scikit-learn being such a huge community. As far as the math is concerned, I have searched for some good references, some of which are listed below: 1. First two points will heavily rely on @mblondel's lightning package, and this blog post: http://sebastianruder.com/optimizing-gradient-descent/ 2. For the losses (third point), I have seen the way existing losses are written in cython, as well as in the metrics submodule. That should help a lot. 3. About the fourth point, first of all I would be happy to get some suggestions from the community. Once satisfied, I should implement a very basic prototype with some existing class, maybe convert verbose logging of some class to a callback structure. Will include that in the second draft of proposal which would be a preliminary version of what I shall submit on GSoC website. More about me: 1. Github Profile: https://www.github.com/karandesai-96 2. GSoC 2016 Project: https://goo.gl/mdFZ6m 3. Joblib Contributions: https://git.io/vyMSx 4. Scikit-learn Contributions: https://git.io/vyMSF I'll be eagerly waiting for feedback. Thanks. Regards, Karan Desai, Department of Electrical Engineering, IIT Roorkee, India. -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Mar 15 10:42:58 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 15 Mar 2017 10:42:58 -0400 Subject: [scikit-learn] Label encoding for classifiers and soft targets In-Reply-To: References: <542B0BDD-F329-4F26-9001-9F535426306C@ende.cc> <20170312183844.GD694569@phare.normalesup.org> <72559155-CB35-441E-9F9D-6FD679033E17@ende.cc> Message-ID: On 03/13/2017 05:54 PM, Javier L?pez Pe?a wrote: >> You could use a regression model with a logistic sigmoid in the output layer. > By training a regression network with logistic activation the outputs do not add to 1. > I just checked on a minimal example on the iris dataset. Sorry meant softmax ;) From t3kcit at gmail.com Wed Mar 15 10:48:23 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 15 Mar 2017 10:48:23 -0400 Subject: [scikit-learn] [GSoC 2017] First Draft, request for suggestions - Improve Online Learning of Linear Models. In-Reply-To: References: Message-ID: On 03/15/2017 04:48 AM, Karan Desai wrote: > 4. About a tool to anneal learning rate: I suggest a new approach to > look at this - as a callback. I searched through the documentation and > I could not find this way of handling tidbits during training of > models. We should be able to provide a callback to the constructor of > a linear model which can do any dedicated job after every epoch, be it > learning rate annealing, saving model checkpoint, getting custom > verbose output, or as creative as uploading data to server for real > time plots on any website. There has been some effort on doing adagrad but it was ultimately discontinued, I think. There was quite a bit of complexity to handle. The problem with callbacks is that for callbacks on each iteration to be feasible, they need to be cython functions. Otherwise they will be too slow. You could do python callbacks, but they could not be called at every iteration, and so they wouldn't be suitable to implement something like adagrad or adam. Best, Andy From shubham.bhardwaj2015 at vit.ac.in Wed Mar 15 13:28:00 2017 From: shubham.bhardwaj2015 at vit.ac.in (SHUBHAM BHARDWAJ 15BCE0704) Date: Wed, 15 Mar 2017 22:58:00 +0530 Subject: [scikit-learn] GSoc, 2017 (proposal idea and intro) .reg In-Reply-To: References: Message-ID: Hello Sir, Greetings. I have coded a sequential version of Scalable Kmeans++ (#8585) and have included a test script for testing it in the pr's description. https://github.com/scikit-learn/scikit-learn/pull/8585. Regards Shubham Bhardwaj On Tue, Mar 14, 2017 at 3:59 AM, Shreyas Saligrama chandrakan < ssaligra at hawk.iit.edu> wrote: > Hi, > > Is it possible for me to contribute a library to introduce SVM's with tree > kernel (like current available one in svmlight) which is currently missing > in scikit-learn? > > Best, > Shreyas > > On 5 Mar 2017 11:03 a.m., "Andreas Mueller" wrote: > >> There was a PR here: >> https://github.com/scikit-learn/scikit-learn/pull/5530 >> >> but it didn't seem to work. Feel free to convince us otherwise ;) >> >> >> On 03/02/2017 08:23 PM, SHUBHAM BHARDWAJ 15BCE0704 wrote: >> >> Hello Sir, >> Very Sorry for the numbers I saw this written in the comments.I assumed >> -Given the person who suggested the paper might have taken a look into the >> number of citations.I will make sure to personally check myself. >> >> Regards >> Shubham Bhardwaj >> >> On Fri, Mar 3, 2017 at 6:40 AM, Guillaume Lema?tre < >> g.lemaitre58 at gmail.com> wrote: >> >>> I think that you mean this paper -> Scalable K-Means++ -> 218 citations >>> >>> On 3 March 2017 at 02:00, SHUBHAM BHARDWAJ 15BCE0704 < >>> shubham.bhardwaj2015 at vit.ac.in> wrote: >>> >>>> Hello Sir, >>>> >>>> Thanks a lot for the reply. Sorry for not being elaborate about what I >>>> was trying to address. I wanted to implement this [ >>>> http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf] (1200+citations)- >>>> mentioned in comments. This pertains to the stalled issue #4357 .Proposal >>>> idea - implementing a scalable kmeans++. >>>> >>>> Regards >>>> Shubham Bhardwaj >>>> >>>> On Fri, Mar 3, 2017 at 12:01 AM, Jacob Schreiber < >>>> jmschreiber91 at gmail.com> wrote: >>>> >>>>> Hi Shubham >>>>> >>>>> Thanks for your interest. I'm eager to see your contributions to >>>>> sklearn in the future. However, I'm pretty sure kmeans++ is already >>>>> implemented: http://scikit-learn.org/stable/modules/generate >>>>> d/sklearn.cluster.KMeans.html >>>>> >>>>> Jacob >>>>> >>>>> On Thu, Mar 2, 2017 at 1:07 AM, SHUBHAM BHARDWAJ 15BCE0704 < >>>>> shubham.bhardwaj2015 at vit.ac.in> wrote: >>>>> >>>>>> Hello Sir, >>>>>> >>>>>> My introduction : >>>>>> I am a 2nd year student studying Computer Science and engineering >>>>>> from VIT, Vellore. I work in Google Developers Group VIT. All my experience >>>>>> has been about collaborating with a lot of people ,working as a team, >>>>>> building products and learning along the way. >>>>>> Since scikit-learn is participating this time I am too planning to >>>>>> submit a proposal. >>>>>> >>>>>> Proposal idea: >>>>>> I am really interested in implementing kmeans++ algorithm.I was doing >>>>>> some work on DT but I found this very appealing. Just wanted to know if it >>>>>> can be a good project idea. >>>>>> >>>>>> Regards >>>>>> Shubham Bhardwaj >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> >>> -- >>> Guillaume Lemaitre >>> INRIA Saclay - Ile-de-France >>> Equipe PARIETAL >>> guillaume.lemaitre at inria.f r --- >>> https://glemaitre.github.io/ >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> _______________________________________________ >> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From skacanski at gmail.com Wed Mar 15 21:20:55 2017 From: skacanski at gmail.com (Sasha Kacanski) Date: Wed, 15 Mar 2017 21:20:55 -0400 Subject: [scikit-learn] best way to scale on the random forest for text w bag of words ... Message-ID: Hi, As soon as number of trees and features goes higher, 70Gb of ram is gone and i am getting out of memory errors. file size is 700Mb. Dataframe quickly shrinks from 14 to 2 columns but there is ton of text ... with 10 estimators and 100 features per word I can't tackle ~900 k of records ... Training set, about 15% of data does perfectly fine but when test come that is it. i can split stuff and multiprocess it but I believe that will simply skew results... Any ideas? -- Aleksandar Kacanski -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Wed Mar 15 21:44:05 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 16 Mar 2017 12:44:05 +1100 Subject: [scikit-learn] best way to scale on the random forest for text w bag of words ... In-Reply-To: References: Message-ID: Trees are not a traditional choice for bag of words models, but you should make sure you are at least using the parameters of the random forest to limit the size (depth, branching) of the trees. On 16 March 2017 at 12:20, Sasha Kacanski wrote: > Hi, > As soon as number of trees and features goes higher, 70Gb of ram is gone > and i am getting out of memory errors. > file size is 700Mb. Dataframe quickly shrinks from 14 to 2 columns but > there is ton of text ... > with 10 estimators and 100 features per word I can't tackle ~900 k of > records ... > Training set, about 15% of data does perfectly fine but when test come > that is it. > > i can split stuff and multiprocess it but I believe that will simply skew > results... > > Any ideas? > > > -- > Aleksandar Kacanski > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Wed Mar 15 23:27:08 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Wed, 15 Mar 2017 23:27:08 -0400 Subject: [scikit-learn] Differences between scikit-learn and Spark.ml for regression toy problem In-Reply-To: References: Message-ID: <2F90C7CA-32D1-48A6-9C44-A658A92FF97F@gmail.com> I think the liblinear solver (default in LogisticRegression) does regularize the bias. So, even if both solutions (sklearn, spark) anneal to the global cost optimum, the model parameters would be different. Maybe a better way to make that comparison would be to turn off regularization completely for now. And when you run the LogisticRegression, maybe run it multiple times with different random seeds to see if your solutions are generally stable. Best, Sebastian > On Mar 13, 2017, at 1:06 PM, Stuart Reynolds wrote: > > Both libraries are heavily parameterized. You should check what the defaults are for both. > > Some ideas: > - What regularization is being used. L1/L2? > - Does the regularization parameter have the same interpretation? 1/C = lambda? Some libraries use C. Some use lambda. > - Also, some libraries regularize the intercept (scikit), other do not. (It doesn't seem like a particularly good idea to regularize the intercept if your optimizer permits not doing it). > > > > On Sun, Mar 12, 2017 at 7:07 PM, Frank Astier via scikit-learn wrote: > (this was also posted to stackoverflow on 03/10) > > I am setting up a very simple logistic regression problem in scikit-learn and in spark.ml, and the results diverge: the models they learn are different, but I can't figure out why (data is the same, model type is the same, regularization is the same...). > > No doubt I am missing some setting on one side or the other. Which setting? How should I set up either scikit or spark.ml to find the same model as its counterpart? > > I give the sklearn code and spark.ml code below. Both should be ready to cut-and-paste and run. > > scikit-learn code: > ---------------------- > > import numpy as np > from sklearn.linear_model import LogisticRegression, Ridge > > X = np.array([ > [-0.7306653538519616, 0.0], > [0.6750417712898752, -0.4232874171873786], > [0.1863463229359709, -0.8163423997075965], > [-0.6719842051493347, 0.0], > [0.9699938346531928, 0.0], > [0.22759406190283604, 0.0], > [0.9688721028330911, 0.0], > [0.5993795346650845, 0.0], > [0.9219423508390701, -0.8972778242305388], > [0.7006904841584055, -0.5607635619919824] > ]) > > y = np.array([ > 0.0, > 1.0, > 1.0, > 0.0, > 1.0, > 1.0, > 1.0, > 0.0, > 0.0, > 0.0 > ]) > > m, n = X.shape > > # Add intercept term to simulate inputs to GameEstimator > X_with_intercept = np.hstack((X, np.ones(m)[:,np.newaxis])) > > l = 0.3 > e = LogisticRegression( > fit_intercept=False, > penalty='l2', > C=1/l, > max_iter=100, > tol=1e-11) > > e.fit(X_with_intercept, y) > > print e.coef_ > # => [[ 0.98662189 0.45571052 -0.23467255]] > > # Linear regression is called Ridge in sklearn > e = Ridge( > fit_intercept=False, > alpha=l, > max_iter=100, > tol=1e-11) > > e.fit(X_with_intercept, y) > > print e.coef_ > # =>[ 0.32155545 0.17904355 0.41222418] > > spark.ml code: > ------------------- > > import org.apache.spark.{SparkConf, SparkContext} > import org.apache.spark.ml.classification.LogisticRegression > import org.apache.spark.ml.regression.LinearRegression > import org.apache.spark.mllib.linalg.Vectors > import org.apache.spark.mllib.regression.LabeledPoint > import org.apache.spark.sql.SQLContext > > object TestSparkRegression { > def main(args: Array[String]): Unit = { > import org.apache.log4j.{Level, Logger} > > Logger.getLogger("org").setLevel(Level.OFF) > Logger.getLogger("akka").setLevel(Level.OFF) > > val conf = new SparkConf().setAppName("test").setMaster("local") > val sc = new SparkContext(conf) > > val sparkTrainingData = new SQLContext(sc) > .createDataFrame(Seq( > LabeledPoint(0.0, Vectors.dense(-0.7306653538519616, 0.0)), > LabeledPoint(1.0, Vectors.dense(0.6750417712898752, -0.4232874171873786)), > LabeledPoint(1.0, Vectors.dense(0.1863463229359709, -0.8163423997075965)), > LabeledPoint(0.0, Vectors.dense(-0.6719842051493347, 0.0)), > LabeledPoint(1.0, Vectors.dense(0.9699938346531928, 0.0)), > LabeledPoint(1.0, Vectors.dense(0.22759406190283604, 0.0)), > LabeledPoint(1.0, Vectors.dense(0.9688721028330911, 0.0)), > LabeledPoint(0.0, Vectors.dense(0.5993795346650845, 0.0)), > LabeledPoint(0.0, Vectors.dense(0.9219423508390701, -0.8972778242305388)), > LabeledPoint(0.0, Vectors.dense(0.7006904841584055, -0.5607635619919824)))) > .toDF("label", "features") > > val logisticModel = new LogisticRegression() > .setRegParam(0.3) > .setLabelCol("label") > .setFeaturesCol("features") > .fit(sparkTrainingData) > > println(s"Spark logistic model coefficients: ${logisticModel.coefficients} Intercept: ${logisticModel.intercept}") > // Spark logistic model coefficients: [0.5451588538376263,0.26740606573584713] Intercept: -0.13897955358689987 > > val linearModel = new LinearRegression() > .setRegParam(0.3) > .setLabelCol("label") > .setFeaturesCol("features") > .setSolver("l-bfgs") > .fit(sparkTrainingData) > > println(s"Spark linear model coefficients: ${linearModel.coefficients} Intercept: ${linearModel.intercept}") > // Spark linear model coefficients: [0.19852664861346023,0.11501200541407802] Intercept: 0.45464906876832323 > > sc.stop() > } > } > > Thanks, > > Frank > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From joel.nothman at gmail.com Wed Mar 15 23:57:35 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 16 Mar 2017 14:57:35 +1100 Subject: [scikit-learn] Differences between scikit-learn and Spark.ml for regression toy problem In-Reply-To: <2F90C7CA-32D1-48A6-9C44-A658A92FF97F@gmail.com> References: <2F90C7CA-32D1-48A6-9C44-A658A92FF97F@gmail.com> Message-ID: sklearn's (and hence liblinear's) intercept is not being used here, but a feature is added in Python to represent the bias, so it's being regularised in any case. On 16 March 2017 at 14:27, Sebastian Raschka wrote: > I think the liblinear solver (default in LogisticRegression) does > regularize the bias. So, even if both solutions (sklearn, spark) anneal to > the global cost optimum, the model parameters would be different. > Maybe a better way to make that comparison would be to turn off > regularization completely for now. And when you run the LogisticRegression, > maybe run it multiple times with different random seeds to see if your > solutions are generally stable. > > Best, > Sebastian > > > On Mar 13, 2017, at 1:06 PM, Stuart Reynolds > wrote: > > > > Both libraries are heavily parameterized. You should check what the > defaults are for both. > > > > Some ideas: > > - What regularization is being used. L1/L2? > > - Does the regularization parameter have the same interpretation? 1/C = > lambda? Some libraries use C. Some use lambda. > > - Also, some libraries regularize the intercept (scikit), other do not. > (It doesn't seem like a particularly good idea to regularize the intercept > if your optimizer permits not doing it). > > > > > > > > On Sun, Mar 12, 2017 at 7:07 PM, Frank Astier via scikit-learn < > scikit-learn at python.org> wrote: > > (this was also posted to stackoverflow on 03/10) > > > > I am setting up a very simple logistic regression problem in > scikit-learn and in spark.ml, and the results diverge: the models they > learn are different, but I can't figure out why (data is the same, model > type is the same, regularization is the same...). > > > > No doubt I am missing some setting on one side or the other. Which > setting? How should I set up either scikit or spark.ml to find the same > model as its counterpart? > > > > I give the sklearn code and spark.ml code below. Both should be ready > to cut-and-paste and run. > > > > scikit-learn code: > > ---------------------- > > > > import numpy as np > > from sklearn.linear_model import LogisticRegression, Ridge > > > > X = np.array([ > > [-0.7306653538519616, 0.0], > > [0.6750417712898752, -0.4232874171873786], > > [0.1863463229359709, -0.8163423997075965], > > [-0.6719842051493347, 0.0], > > [0.9699938346531928, 0.0], > > [0.22759406190283604, 0.0], > > [0.9688721028330911, 0.0], > > [0.5993795346650845, 0.0], > > [0.9219423508390701, -0.8972778242305388], > > [0.7006904841584055, -0.5607635619919824] > > ]) > > > > y = np.array([ > > 0.0, > > 1.0, > > 1.0, > > 0.0, > > 1.0, > > 1.0, > > 1.0, > > 0.0, > > 0.0, > > 0.0 > > ]) > > > > m, n = X.shape > > > > # Add intercept term to simulate inputs to GameEstimator > > X_with_intercept = np.hstack((X, np.ones(m)[:,np.newaxis])) > > > > l = 0.3 > > e = LogisticRegression( > > fit_intercept=False, > > penalty='l2', > > C=1/l, > > max_iter=100, > > tol=1e-11) > > > > e.fit(X_with_intercept, y) > > > > print e.coef_ > > # => [[ 0.98662189 0.45571052 -0.23467255]] > > > > # Linear regression is called Ridge in sklearn > > e = Ridge( > > fit_intercept=False, > > alpha=l, > > max_iter=100, > > tol=1e-11) > > > > e.fit(X_with_intercept, y) > > > > print e.coef_ > > # =>[ 0.32155545 0.17904355 0.41222418] > > > > spark.ml code: > > ------------------- > > > > import org.apache.spark.{SparkConf, SparkContext} > > import org.apache.spark.ml.classification.LogisticRegression > > import org.apache.spark.ml.regression.LinearRegression > > import org.apache.spark.mllib.linalg.Vectors > > import org.apache.spark.mllib.regression.LabeledPoint > > import org.apache.spark.sql.SQLContext > > > > object TestSparkRegression { > > def main(args: Array[String]): Unit = { > > import org.apache.log4j.{Level, Logger} > > > > Logger.getLogger("org").setLevel(Level.OFF) > > Logger.getLogger("akka").setLevel(Level.OFF) > > > > val conf = new SparkConf().setAppName("test").setMaster("local") > > val sc = new SparkContext(conf) > > > > val sparkTrainingData = new SQLContext(sc) > > .createDataFrame(Seq( > > LabeledPoint(0.0, Vectors.dense(-0.7306653538519616, 0.0)), > > LabeledPoint(1.0, Vectors.dense(0.6750417712898752, > -0.4232874171873786)), > > LabeledPoint(1.0, Vectors.dense(0.1863463229359709, > -0.8163423997075965)), > > LabeledPoint(0.0, Vectors.dense(-0.6719842051493347, 0.0)), > > LabeledPoint(1.0, Vectors.dense(0.9699938346531928, 0.0)), > > LabeledPoint(1.0, Vectors.dense(0.22759406190283604, 0.0)), > > LabeledPoint(1.0, Vectors.dense(0.9688721028330911, 0.0)), > > LabeledPoint(0.0, Vectors.dense(0.5993795346650845, 0.0)), > > LabeledPoint(0.0, Vectors.dense(0.9219423508390701, > -0.8972778242305388)), > > LabeledPoint(0.0, Vectors.dense(0.7006904841584055, > -0.5607635619919824)))) > > .toDF("label", "features") > > > > val logisticModel = new LogisticRegression() > > .setRegParam(0.3) > > .setLabelCol("label") > > .setFeaturesCol("features") > > .fit(sparkTrainingData) > > > > println(s"Spark logistic model coefficients: > ${logisticModel.coefficients} Intercept: ${logisticModel.intercept}") > > // Spark logistic model coefficients: [0.5451588538376263,0.26740606573584713] > Intercept: -0.13897955358689987 > > > > val linearModel = new LinearRegression() > > .setRegParam(0.3) > > .setLabelCol("label") > > .setFeaturesCol("features") > > .setSolver("l-bfgs") > > .fit(sparkTrainingData) > > > > println(s"Spark linear model coefficients: > ${linearModel.coefficients} Intercept: ${linearModel.intercept}") > > // Spark linear model coefficients: [0.19852664861346023,0.11501200541407802] > Intercept: 0.45464906876832323 > > > > sc.stop() > > } > > } > > > > Thanks, > > > > Frank > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From noflaco at gmail.com Thu Mar 16 00:00:12 2017 From: noflaco at gmail.com (Carlton Banks) Date: Thu, 16 Mar 2017 05:00:12 +0100 Subject: [scikit-learn] GridsearchCV Message-ID: <4FEBA91C-07A4-4AFB-932B-1B175A89D592@gmail.com> Hi? I currently trying to optimize my CNN model using gridsearchCV, but seem to have some problems feading my input data.. My training data is stored as a list of Np.ndarrays of shape (6,3,3) and my output is stored as a list of np.array with one entry. Why am I having problems parsing my data to it? best regards Carl B. From se.raschka at gmail.com Thu Mar 16 00:30:44 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Thu, 16 Mar 2017 00:30:44 -0400 Subject: [scikit-learn] GridsearchCV In-Reply-To: <4FEBA91C-07A4-4AFB-932B-1B175A89D592@gmail.com> References: <4FEBA91C-07A4-4AFB-932B-1B175A89D592@gmail.com> Message-ID: <93F23983-0958-4975-883E-2A6747799150@gmail.com> Sklearn estimators typically assume 2d inputs (as numpy arrays) with shape=[n_samples, n_features]. > list of Np.ndarrays of shape (6,3,3) I assume you mean a 3D tensor (3D numpy array) with shape=[n_samples, n_pixels, n_pixels]? What you could do is to reshape it before you put it in, i.e., data_ary = your_ary.reshape(n_samples, -1).shape then, you need to add a line at the beginning your CNN class that does the reverse, i.e., data_ary.reshape(6, n_pixels, n_pixels).shape. Numpy?s reshape typically returns view objects, so that these additional steps shouldn?t be ?too? expensive. Best, Sebastian > On Mar 16, 2017, at 12:00 AM, Carlton Banks wrote: > > Hi? > > I currently trying to optimize my CNN model using gridsearchCV, but seem to have some problems feading my input data.. > > My training data is stored as a list of Np.ndarrays of shape (6,3,3) and my output is stored as a list of np.array with one entry. > > Why am I having problems parsing my data to it? > > best regards > Carl B. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From noflaco at gmail.com Thu Mar 16 00:46:39 2017 From: noflaco at gmail.com (Carlton Banks) Date: Thu, 16 Mar 2017 05:46:39 +0100 Subject: [scikit-learn] GridsearchCV In-Reply-To: <93F23983-0958-4975-883E-2A6747799150@gmail.com> References: <4FEBA91C-07A4-4AFB-932B-1B175A89D592@gmail.com> <93F23983-0958-4975-883E-2A6747799150@gmail.com> Message-ID: The ndarray (6,3,3) => (row, col,color channels) I tried fixing it converting the list of numpy.ndarray to numpy.asarray(list) but this causes a different problem: is grid use a lot a memory.. I am running on a super computer, and seem to have problems with memory.. already used 62 gb ram.. > Den 16. mar. 2017 kl. 05.30 skrev Sebastian Raschka : > > Sklearn estimators typically assume 2d inputs (as numpy arrays) with shape=[n_samples, n_features]. > >> list of Np.ndarrays of shape (6,3,3) > > I assume you mean a 3D tensor (3D numpy array) with shape=[n_samples, n_pixels, n_pixels]? What you could do is to reshape it before you put it in, i.e., > > data_ary = your_ary.reshape(n_samples, -1).shape > > then, you need to add a line at the beginning your CNN class that does the reverse, i.e., data_ary.reshape(6, n_pixels, n_pixels).shape. Numpy?s reshape typically returns view objects, so that these additional steps shouldn?t be ?too? expensive. > > Best, > Sebastian > > > >> On Mar 16, 2017, at 12:00 AM, Carlton Banks wrote: >> >> Hi? >> >> I currently trying to optimize my CNN model using gridsearchCV, but seem to have some problems feading my input data.. >> >> My training data is stored as a list of Np.ndarrays of shape (6,3,3) and my output is stored as a list of np.array with one entry. >> >> Why am I having problems parsing my data to it? >> >> best regards >> Carl B. >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From joel.nothman at gmail.com Thu Mar 16 00:58:20 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 16 Mar 2017 15:58:20 +1100 Subject: [scikit-learn] GridsearchCV In-Reply-To: References: <4FEBA91C-07A4-4AFB-932B-1B175A89D592@gmail.com> <93F23983-0958-4975-883E-2A6747799150@gmail.com> Message-ID: If you're using something like n_jobs=-1, that will explode memory usage in proportion to the number of cores, and particularly so if you're passing the data as a list rather than array and hence can't take advantage of memmapped data parallelism. On 16 March 2017 at 15:46, Carlton Banks wrote: > The ndarray (6,3,3) => (row, col,color channels) > > I tried fixing it converting the list of numpy.ndarray to > numpy.asarray(list) > > but this causes a different problem: > > is grid use a lot a memory.. I am running on a super computer, and seem to > have problems with memory.. already used 62 gb ram.. > > > Den 16. mar. 2017 kl. 05.30 skrev Sebastian Raschka < > se.raschka at gmail.com>: > > > > Sklearn estimators typically assume 2d inputs (as numpy arrays) with > shape=[n_samples, n_features]. > > > >> list of Np.ndarrays of shape (6,3,3) > > > > I assume you mean a 3D tensor (3D numpy array) with shape=[n_samples, > n_pixels, n_pixels]? What you could do is to reshape it before you put it > in, i.e., > > > > data_ary = your_ary.reshape(n_samples, -1).shape > > > > then, you need to add a line at the beginning your CNN class that does > the reverse, i.e., data_ary.reshape(6, n_pixels, n_pixels).shape. Numpy?s > reshape typically returns view objects, so that these additional steps > shouldn?t be ?too? expensive. > > > > Best, > > Sebastian > > > > > > > >> On Mar 16, 2017, at 12:00 AM, Carlton Banks wrote: > >> > >> Hi? > >> > >> I currently trying to optimize my CNN model using gridsearchCV, but > seem to have some problems feading my input data.. > >> > >> My training data is stored as a list of Np.ndarrays of shape (6,3,3) > and my output is stored as a list of np.array with one entry. > >> > >> Why am I having problems parsing my data to it? > >> > >> best regards > >> Carl B. > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Thu Mar 16 01:00:17 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Thu, 16 Mar 2017 01:00:17 -0400 Subject: [scikit-learn] GridsearchCV In-Reply-To: References: <4FEBA91C-07A4-4AFB-932B-1B175A89D592@gmail.com> <93F23983-0958-4975-883E-2A6747799150@gmail.com> Message-ID: Hm, if you set n_jobs>1, then I think it?s using multiprocessing, which will pass a copy of the input data to each process. That could be one reason for the relatively large memory consumption. > On Mar 16, 2017, at 12:46 AM, Carlton Banks wrote: > > The ndarray (6,3,3) => (row, col,color channels) > > I tried fixing it converting the list of numpy.ndarray to numpy.asarray(list) > > but this causes a different problem: > > is grid use a lot a memory.. I am running on a super computer, and seem to have problems with memory.. already used 62 gb ram.. > >> Den 16. mar. 2017 kl. 05.30 skrev Sebastian Raschka : >> >> Sklearn estimators typically assume 2d inputs (as numpy arrays) with shape=[n_samples, n_features]. >> >>> list of Np.ndarrays of shape (6,3,3) >> >> I assume you mean a 3D tensor (3D numpy array) with shape=[n_samples, n_pixels, n_pixels]? What you could do is to reshape it before you put it in, i.e., >> >> data_ary = your_ary.reshape(n_samples, -1).shape >> >> then, you need to add a line at the beginning your CNN class that does the reverse, i.e., data_ary.reshape(6, n_pixels, n_pixels).shape. Numpy?s reshape typically returns view objects, so that these additional steps shouldn?t be ?too? expensive. >> >> Best, >> Sebastian >> >> >> >>> On Mar 16, 2017, at 12:00 AM, Carlton Banks wrote: >>> >>> Hi? >>> >>> I currently trying to optimize my CNN model using gridsearchCV, but seem to have some problems feading my input data.. >>> >>> My training data is stored as a list of Np.ndarrays of shape (6,3,3) and my output is stored as a list of np.array with one entry. >>> >>> Why am I having problems parsing my data to it? >>> >>> best regards >>> Carl B. >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From noflaco at gmail.com Thu Mar 16 01:01:18 2017 From: noflaco at gmail.com (Carlton Banks) Date: Thu, 16 Mar 2017 06:01:18 +0100 Subject: [scikit-learn] GridsearchCV In-Reply-To: References: <4FEBA91C-07A4-4AFB-932B-1B175A89D592@gmail.com> <93F23983-0958-4975-883E-2A6747799150@gmail.com> Message-ID: Oh? totally forgot about that.. why -1? > Den 16. mar. 2017 kl. 05.58 skrev Joel Nothman : > > If you're using something like n_jobs=-1, that will explode memory usage in proportion to the number of cores, and particularly so if you're passing the data as a list rather than array and hence can't take advantage of memmapped data parallelism. > > On 16 March 2017 at 15:46, Carlton Banks > wrote: > The ndarray (6,3,3) => (row, col,color channels) > > I tried fixing it converting the list of numpy.ndarray to numpy.asarray(list) > > but this causes a different problem: > > is grid use a lot a memory.. I am running on a super computer, and seem to have problems with memory.. already used 62 gb ram.. > > > Den 16. mar. 2017 kl. 05.30 skrev Sebastian Raschka >: > > > > Sklearn estimators typically assume 2d inputs (as numpy arrays) with shape=[n_samples, n_features]. > > > >> list of Np.ndarrays of shape (6,3,3) > > > > I assume you mean a 3D tensor (3D numpy array) with shape=[n_samples, n_pixels, n_pixels]? What you could do is to reshape it before you put it in, i.e., > > > > data_ary = your_ary.reshape(n_samples, -1).shape > > > > then, you need to add a line at the beginning your CNN class that does the reverse, i.e., data_ary.reshape(6, n_pixels, n_pixels).shape. Numpy?s reshape typically returns view objects, so that these additional steps shouldn?t be ?too? expensive. > > > > Best, > > Sebastian > > > > > > > >> On Mar 16, 2017, at 12:00 AM, Carlton Banks > wrote: > >> > >> Hi? > >> > >> I currently trying to optimize my CNN model using gridsearchCV, but seem to have some problems feading my input data.. > >> > >> My training data is stored as a list of Np.ndarrays of shape (6,3,3) and my output is stored as a list of np.array with one entry. > >> > >> Why am I having problems parsing my data to it? > >> > >> best regards > >> Carl B. > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From noflaco at gmail.com Thu Mar 16 01:03:24 2017 From: noflaco at gmail.com (Carlton Banks) Date: Thu, 16 Mar 2017 06:03:24 +0100 Subject: [scikit-learn] GridsearchCV In-Reply-To: References: <4FEBA91C-07A4-4AFB-932B-1B175A89D592@gmail.com> <93F23983-0958-4975-883E-2A6747799150@gmail.com> Message-ID: i was wondering about the minus in front? > Den 16. mar. 2017 kl. 06.00 skrev Sebastian Raschka : > > Hm, if you set n_jobs>1, then I think it?s using multiprocessing, which will pass a copy of the input data to each process. That could be one reason for the relatively large memory consumption. > >> On Mar 16, 2017, at 12:46 AM, Carlton Banks wrote: >> >> The ndarray (6,3,3) => (row, col,color channels) >> >> I tried fixing it converting the list of numpy.ndarray to numpy.asarray(list) >> >> but this causes a different problem: >> >> is grid use a lot a memory.. I am running on a super computer, and seem to have problems with memory.. already used 62 gb ram.. >> >>> Den 16. mar. 2017 kl. 05.30 skrev Sebastian Raschka : >>> >>> Sklearn estimators typically assume 2d inputs (as numpy arrays) with shape=[n_samples, n_features]. >>> >>>> list of Np.ndarrays of shape (6,3,3) >>> >>> I assume you mean a 3D tensor (3D numpy array) with shape=[n_samples, n_pixels, n_pixels]? What you could do is to reshape it before you put it in, i.e., >>> >>> data_ary = your_ary.reshape(n_samples, -1).shape >>> >>> then, you need to add a line at the beginning your CNN class that does the reverse, i.e., data_ary.reshape(6, n_pixels, n_pixels).shape. Numpy?s reshape typically returns view objects, so that these additional steps shouldn?t be ?too? expensive. >>> >>> Best, >>> Sebastian >>> >>> >>> >>>> On Mar 16, 2017, at 12:00 AM, Carlton Banks wrote: >>>> >>>> Hi? >>>> >>>> I currently trying to optimize my CNN model using gridsearchCV, but seem to have some problems feading my input data.. >>>> >>>> My training data is stored as a list of Np.ndarrays of shape (6,3,3) and my output is stored as a list of np.array with one entry. >>>> >>>> Why am I having problems parsing my data to it? >>>> >>>> best regards >>>> Carl B. >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From se.raschka at gmail.com Thu Mar 16 01:06:17 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Thu, 16 Mar 2017 01:06:17 -0400 Subject: [scikit-learn] GridsearchCV In-Reply-To: References: <4FEBA91C-07A4-4AFB-932B-1B175A89D592@gmail.com> <93F23983-0958-4975-883E-2A6747799150@gmail.com> Message-ID: the ?-1? means that it will run on all processors that are available > On Mar 16, 2017, at 1:01 AM, Carlton Banks wrote: > > Oh? totally forgot about that.. why -1? >> Den 16. mar. 2017 kl. 05.58 skrev Joel Nothman : >> >> If you're using something like n_jobs=-1, that will explode memory usage in proportion to the number of cores, and particularly so if you're passing the data as a list rather than array and hence can't take advantage of memmapped data parallelism. >> >> On 16 March 2017 at 15:46, Carlton Banks wrote: >> The ndarray (6,3,3) => (row, col,color channels) >> >> I tried fixing it converting the list of numpy.ndarray to numpy.asarray(list) >> >> but this causes a different problem: >> >> is grid use a lot a memory.. I am running on a super computer, and seem to have problems with memory.. already used 62 gb ram.. >> >> > Den 16. mar. 2017 kl. 05.30 skrev Sebastian Raschka : >> > >> > Sklearn estimators typically assume 2d inputs (as numpy arrays) with shape=[n_samples, n_features]. >> > >> >> list of Np.ndarrays of shape (6,3,3) >> > >> > I assume you mean a 3D tensor (3D numpy array) with shape=[n_samples, n_pixels, n_pixels]? What you could do is to reshape it before you put it in, i.e., >> > >> > data_ary = your_ary.reshape(n_samples, -1).shape >> > >> > then, you need to add a line at the beginning your CNN class that does the reverse, i.e., data_ary.reshape(6, n_pixels, n_pixels).shape. Numpy?s reshape typically returns view objects, so that these additional steps shouldn?t be ?too? expensive. >> > >> > Best, >> > Sebastian >> > >> > >> > >> >> On Mar 16, 2017, at 12:00 AM, Carlton Banks wrote: >> >> >> >> Hi? >> >> >> >> I currently trying to optimize my CNN model using gridsearchCV, but seem to have some problems feading my input data.. >> >> >> >> My training data is stored as a list of Np.ndarrays of shape (6,3,3) and my output is stored as a list of np.array with one entry. >> >> >> >> Why am I having problems parsing my data to it? >> >> >> >> best regards >> >> Carl B. >> >> _______________________________________________ >> >> scikit-learn mailing list >> >> scikit-learn at python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From noflaco at gmail.com Thu Mar 16 01:08:14 2017 From: noflaco at gmail.com (Carlton Banks) Date: Thu, 16 Mar 2017 06:08:14 +0100 Subject: [scikit-learn] GridsearchCV In-Reply-To: References: <4FEBA91C-07A4-4AFB-932B-1B175A89D592@gmail.com> <93F23983-0958-4975-883E-2A6747799150@gmail.com> Message-ID: I changed it to -48?.. and it seem to be running.. > Den 16. mar. 2017 kl. 06.06 skrev Sebastian Raschka : > > the ?-1? means that it will run on all processors that are available > >> On Mar 16, 2017, at 1:01 AM, Carlton Banks wrote: >> >> Oh? totally forgot about that.. why -1? >>> Den 16. mar. 2017 kl. 05.58 skrev Joel Nothman : >>> >>> If you're using something like n_jobs=-1, that will explode memory usage in proportion to the number of cores, and particularly so if you're passing the data as a list rather than array and hence can't take advantage of memmapped data parallelism. >>> >>> On 16 March 2017 at 15:46, Carlton Banks wrote: >>> The ndarray (6,3,3) => (row, col,color channels) >>> >>> I tried fixing it converting the list of numpy.ndarray to numpy.asarray(list) >>> >>> but this causes a different problem: >>> >>> is grid use a lot a memory.. I am running on a super computer, and seem to have problems with memory.. already used 62 gb ram.. >>> >>>> Den 16. mar. 2017 kl. 05.30 skrev Sebastian Raschka : >>>> >>>> Sklearn estimators typically assume 2d inputs (as numpy arrays) with shape=[n_samples, n_features]. >>>> >>>>> list of Np.ndarrays of shape (6,3,3) >>>> >>>> I assume you mean a 3D tensor (3D numpy array) with shape=[n_samples, n_pixels, n_pixels]? What you could do is to reshape it before you put it in, i.e., >>>> >>>> data_ary = your_ary.reshape(n_samples, -1).shape >>>> >>>> then, you need to add a line at the beginning your CNN class that does the reverse, i.e., data_ary.reshape(6, n_pixels, n_pixels).shape. Numpy?s reshape typically returns view objects, so that these additional steps shouldn?t be ?too? expensive. >>>> >>>> Best, >>>> Sebastian >>>> >>>> >>>> >>>>> On Mar 16, 2017, at 12:00 AM, Carlton Banks wrote: >>>>> >>>>> Hi? >>>>> >>>>> I currently trying to optimize my CNN model using gridsearchCV, but seem to have some problems feading my input data.. >>>>> >>>>> My training data is stored as a list of Np.ndarrays of shape (6,3,3) and my output is stored as a list of np.array with one entry. >>>>> >>>>> Why am I having problems parsing my data to it? >>>>> >>>>> best regards >>>>> Carl B. >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From noflaco at gmail.com Thu Mar 16 01:14:51 2017 From: noflaco at gmail.com (Carlton Banks) Date: Thu, 16 Mar 2017 06:14:51 +0100 Subject: [scikit-learn] GridsearchCV In-Reply-To: References: <4FEBA91C-07A4-4AFB-932B-1B175A89D592@gmail.com> <93F23983-0958-4975-883E-2A6747799150@gmail.com> Message-ID: What is the highest level of verbose? > Den 16. mar. 2017 kl. 06.08 skrev Carlton Banks : > > I changed it to -48?.. and it seem to be running.. >> Den 16. mar. 2017 kl. 06.06 skrev Sebastian Raschka : >> >> the ?-1? means that it will run on all processors that are available >> >>> On Mar 16, 2017, at 1:01 AM, Carlton Banks wrote: >>> >>> Oh? totally forgot about that.. why -1? >>>> Den 16. mar. 2017 kl. 05.58 skrev Joel Nothman : >>>> >>>> If you're using something like n_jobs=-1, that will explode memory usage in proportion to the number of cores, and particularly so if you're passing the data as a list rather than array and hence can't take advantage of memmapped data parallelism. >>>> >>>> On 16 March 2017 at 15:46, Carlton Banks wrote: >>>> The ndarray (6,3,3) => (row, col,color channels) >>>> >>>> I tried fixing it converting the list of numpy.ndarray to numpy.asarray(list) >>>> >>>> but this causes a different problem: >>>> >>>> is grid use a lot a memory.. I am running on a super computer, and seem to have problems with memory.. already used 62 gb ram.. >>>> >>>>> Den 16. mar. 2017 kl. 05.30 skrev Sebastian Raschka : >>>>> >>>>> Sklearn estimators typically assume 2d inputs (as numpy arrays) with shape=[n_samples, n_features]. >>>>> >>>>>> list of Np.ndarrays of shape (6,3,3) >>>>> >>>>> I assume you mean a 3D tensor (3D numpy array) with shape=[n_samples, n_pixels, n_pixels]? What you could do is to reshape it before you put it in, i.e., >>>>> >>>>> data_ary = your_ary.reshape(n_samples, -1).shape >>>>> >>>>> then, you need to add a line at the beginning your CNN class that does the reverse, i.e., data_ary.reshape(6, n_pixels, n_pixels).shape. Numpy?s reshape typically returns view objects, so that these additional steps shouldn?t be ?too? expensive. >>>>> >>>>> Best, >>>>> Sebastian >>>>> >>>>> >>>>> >>>>>> On Mar 16, 2017, at 12:00 AM, Carlton Banks wrote: >>>>>> >>>>>> Hi? >>>>>> >>>>>> I currently trying to optimize my CNN model using gridsearchCV, but seem to have some problems feading my input data.. >>>>>> >>>>>> My training data is stored as a list of Np.ndarrays of shape (6,3,3) and my output is stored as a list of np.array with one entry. >>>>>> >>>>>> Why am I having problems parsing my data to it? >>>>>> >>>>>> best regards >>>>>> Carl B. >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > From se.raschka at gmail.com Thu Mar 16 01:33:32 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Thu, 16 Mar 2017 01:33:32 -0400 Subject: [scikit-learn] GridsearchCV In-Reply-To: References: <4FEBA91C-07A4-4AFB-932B-1B175A89D592@gmail.com> <93F23983-0958-4975-883E-2A6747799150@gmail.com> Message-ID: <69CC85FA-C763-4676-BE51-B12CA88B2A95@gmail.com> I am not sure what actually happens if you choose negative integers other than -1. Typically, you would choose either -1, 1 or a positive integer, sth like -1: all available cpus 1: 1 process 2: 2 processes ? 10: 10 process ? > On Mar 16, 2017, at 1:08 AM, Carlton Banks wrote: > > I changed it to -48?.. and it seem to be running.. >> Den 16. mar. 2017 kl. 06.06 skrev Sebastian Raschka : >> >> the ?-1? means that it will run on all processors that are available >> >>> On Mar 16, 2017, at 1:01 AM, Carlton Banks wrote: >>> >>> Oh? totally forgot about that.. why -1? >>>> Den 16. mar. 2017 kl. 05.58 skrev Joel Nothman : >>>> >>>> If you're using something like n_jobs=-1, that will explode memory usage in proportion to the number of cores, and particularly so if you're passing the data as a list rather than array and hence can't take advantage of memmapped data parallelism. >>>> >>>> On 16 March 2017 at 15:46, Carlton Banks wrote: >>>> The ndarray (6,3,3) => (row, col,color channels) >>>> >>>> I tried fixing it converting the list of numpy.ndarray to numpy.asarray(list) >>>> >>>> but this causes a different problem: >>>> >>>> is grid use a lot a memory.. I am running on a super computer, and seem to have problems with memory.. already used 62 gb ram.. >>>> >>>>> Den 16. mar. 2017 kl. 05.30 skrev Sebastian Raschka : >>>>> >>>>> Sklearn estimators typically assume 2d inputs (as numpy arrays) with shape=[n_samples, n_features]. >>>>> >>>>>> list of Np.ndarrays of shape (6,3,3) >>>>> >>>>> I assume you mean a 3D tensor (3D numpy array) with shape=[n_samples, n_pixels, n_pixels]? What you could do is to reshape it before you put it in, i.e., >>>>> >>>>> data_ary = your_ary.reshape(n_samples, -1).shape >>>>> >>>>> then, you need to add a line at the beginning your CNN class that does the reverse, i.e., data_ary.reshape(6, n_pixels, n_pixels).shape. Numpy?s reshape typically returns view objects, so that these additional steps shouldn?t be ?too? expensive. >>>>> >>>>> Best, >>>>> Sebastian >>>>> >>>>> >>>>> >>>>>> On Mar 16, 2017, at 12:00 AM, Carlton Banks wrote: >>>>>> >>>>>> Hi? >>>>>> >>>>>> I currently trying to optimize my CNN model using gridsearchCV, but seem to have some problems feading my input data.. >>>>>> >>>>>> My training data is stored as a list of Np.ndarrays of shape (6,3,3) and my output is stored as a list of np.array with one entry. >>>>>> >>>>>> Why am I having problems parsing my data to it? >>>>>> >>>>>> best regards >>>>>> Carl B. >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From gael.varoquaux at normalesup.org Thu Mar 16 03:18:30 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Thu, 16 Mar 2017 08:18:30 +0100 Subject: [scikit-learn] PyParis 2017 Message-ID: <20170316071830.GF2442192@phare.normalesup.org> The PyParis conference will be held in ... Paris! June 12-13: http://pyparis.org/cfp.html There will be a data track, that will be very scikit-learn related. Scikit-learn users or developers are most welcomed to come and talk about what they are doing. Cheers, Ga?l -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From rth.yurchak at gmail.com Thu Mar 16 08:25:44 2017 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Thu, 16 Mar 2017 13:25:44 +0100 Subject: [scikit-learn] best way to scale on the random forest for text w bag of words ... In-Reply-To: References: Message-ID: <095e98ac-beab-7aa0-e635-a6fc41c0323c@gmail.com> If you run out of memory at the prediction step, splitting the test dataset in batches, then concatenating the results should work fine. Why would it "skew" the results? 70GB RAM seems huge: for comparison here is some categorization benchmarks on a 700k text dataset, that use more in the order of 5-10 GB RAM, https://github.com/FreeDiscovery/FreeDiscovery/issues/58 though with fairly short documents, for other algorithms and with a smaller training set. You could also try reducing the size of your dictionary with hashing. If you really want to use random forest and have memory constraints, you might want to use n_jobs=1 to avoid memory copies, https://www.quora.com/Why-is-scikit-learns-random-forest-using-so-much-memory But as Joel was saying, random forest might not the best choice for huge sparse arrays; NaiveBayes, LogisticRegression or SVM could be better suited, or gradient boosting if you want to go that way... On 16/03/17 02:44, Joel Nothman wrote: > Trees are not a traditional choice for bag of words models, but you > should make sure you are at least using the parameters of the random > forest to limit the size (depth, branching) of the trees. > > On 16 March 2017 at 12:20, Sasha Kacanski > wrote: > > Hi, > As soon as number of trees and features goes higher, 70Gb of ram is > gone and i am getting out of memory errors. > file size is 700Mb. Dataframe quickly shrinks from 14 to 2 columns > but there is ton of text ... > with 10 estimators and 100 features per word I can't tackle ~900 k > of records ... > Training set, about 15% of data does perfectly fine but when test > come that is it. > > i can split stuff and multiprocess it but I believe that will simply > skew results... > > Any ideas? > > > -- > Aleksandar Kacanski > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From skacanski at gmail.com Thu Mar 16 08:38:03 2017 From: skacanski at gmail.com (Sasha Kacanski) Date: Thu, 16 Mar 2017 08:38:03 -0400 Subject: [scikit-learn] best way to scale on the random forest for text w bag of words ... In-Reply-To: References: Message-ID: Thanks Joel, what would be your approach? Sasha Kacanski On Mar 15, 2017 9:46 PM, "Joel Nothman" wrote: > Trees are not a traditional choice for bag of words models, but you should > make sure you are at least using the parameters of the random forest to > limit the size (depth, branching) of the trees. > > On 16 March 2017 at 12:20, Sasha Kacanski wrote: > >> Hi, >> As soon as number of trees and features goes higher, 70Gb of ram is gone >> and i am getting out of memory errors. >> file size is 700Mb. Dataframe quickly shrinks from 14 to 2 columns but >> there is ton of text ... >> with 10 estimators and 100 features per word I can't tackle ~900 k of >> records ... >> Training set, about 15% of data does perfectly fine but when test come >> that is it. >> >> i can split stuff and multiprocess it but I believe that will simply skew >> results... >> >> Any ideas? >> >> >> -- >> Aleksandar Kacanski >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From skacanski at gmail.com Thu Mar 16 10:23:36 2017 From: skacanski at gmail.com (Sasha Kacanski) Date: Thu, 16 Mar 2017 10:23:36 -0400 Subject: [scikit-learn] best way to scale on the random forest for text w bag of words ... In-Reply-To: <095e98ac-beab-7aa0-e635-a6fc41c0323c@gmail.com> References: <095e98ac-beab-7aa0-e635-a6fc41c0323c@gmail.com> Message-ID: Thank you very much... I will try alternatives Sasha Kacanski On Mar 16, 2017 8:28 AM, "Roman Yurchak" wrote: > If you run out of memory at the prediction step, splitting the test > dataset in batches, then concatenating the results should work fine. Why > would it "skew" the results? > > 70GB RAM seems huge: for comparison here is some categorization benchmarks > on a 700k text dataset, that use more in the order of 5-10 GB RAM, > https://github.com/FreeDiscovery/FreeDiscovery/issues/58 > though with fairly short documents, for other algorithms and with a > smaller training set. > > You could also try reducing the size of your dictionary with hashing. > If you really want to use random forest and have memory constraints, you > might want to use n_jobs=1 to avoid memory copies, > > https://www.quora.com/Why-is-scikit-learns-random-forest-usi > ng-so-much-memory > > But as Joel was saying, random forest might not the best choice for huge > sparse arrays; NaiveBayes, LogisticRegression or SVM could be better > suited, or gradient boosting if you want to go that way... > > > On 16/03/17 02:44, Joel Nothman wrote: > >> Trees are not a traditional choice for bag of words models, but you >> should make sure you are at least using the parameters of the random >> forest to limit the size (depth, branching) of the trees. >> >> On 16 March 2017 at 12:20, Sasha Kacanski > > wrote: >> >> Hi, >> As soon as number of trees and features goes higher, 70Gb of ram is >> gone and i am getting out of memory errors. >> file size is 700Mb. Dataframe quickly shrinks from 14 to 2 columns >> but there is ton of text ... >> with 10 estimators and 100 features per word I can't tackle ~900 k >> of records ... >> Training set, about 15% of data does perfectly fine but when test >> come that is it. >> >> i can split stuff and multiprocess it but I believe that will simply >> skew results... >> >> Any ideas? >> >> >> -- >> Aleksandar Kacanski >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From noflaco at gmail.com Thu Mar 16 11:50:49 2017 From: noflaco at gmail.com (Carlton Banks) Date: Thu, 16 Mar 2017 16:50:49 +0100 Subject: [scikit-learn] Is something wrong with this gridsearchCV? Message-ID: I am currently using grid search to optimize my keras model? Something seemed a bit off during the training? https://www.dropbox.com/s/da0ztv2kqtkrfpu/Screenshot%20from%202017-03-16%2016%3A43%3A42.png?dl=0 For some reason is the training for each epoch not done for all datapoints?? What could be wrong? Here is the code: http://pastebin.com/raw/itJFm5a6 Anything that seems off? -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Thu Mar 16 12:27:02 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Thu, 16 Mar 2017 12:27:02 -0400 Subject: [scikit-learn] Is something wrong with this gridsearchCV? In-Reply-To: References: Message-ID: <3665E87A-9A82-4AA5-9759-9283804F52BE@gmail.com> I am not using Keras and don?t know how nicely it plays with sklearn objects these days, but you are not giving all the data to the grid search object, which is why your model doesn?t get to see the whole dataset during grid search; i.e., you have `np.asarray(input_train[:-(len(input_train)/1000)]` > On Mar 16, 2017, at 11:50 AM, Carlton Banks wrote: > > I am currently using grid search to optimize my keras model? > > Something seemed a bit off during the training? > > https://www.dropbox.com/s/da0ztv2kqtkrfpu/Screenshot%20from%202017-03-16%2016%3A43%3A42.png?dl=0 > > For some reason is the training for each epoch not done for all datapoints?? > > What could be wrong? > > Here is the code: > > http://pastebin.com/raw/itJFm5a6 > > Anything that seems off? > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From julio at esbet.es Thu Mar 16 12:30:48 2017 From: julio at esbet.es (Julio Antonio Soto de Vicente) Date: Thu, 16 Mar 2017 17:30:48 +0100 Subject: [scikit-learn] Is something wrong with this gridsearchCV? In-Reply-To: References: Message-ID: IMO this has nothing to do with GridSearchCV itself... It rather looks like different (verbose) keras models are being trained simultaneously, and therefore "collapsing" your stdout. I recommend setting Keras verbosity level to 3, in order to avoid printing the progress bars during GridSearchCV (which can be misleading). -- Julio > El 16 mar 2017, a las 16:50, Carlton Banks escribi?: > > I am currently using grid search to optimize my keras model? > > Something seemed a bit off during the training? > > https://www.dropbox.com/s/da0ztv2kqtkrfpu/Screenshot%20from%202017-03-16%2016%3A43%3A42.png?dl=0 > > For some reason is the training for each epoch not done for all datapoints?? > > What could be wrong? > > Here is the code: > > http://pastebin.com/raw/itJFm5a6 > > Anything that seems off? > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From noflaco at gmail.com Thu Mar 16 12:31:28 2017 From: noflaco at gmail.com (Carlton Banks) Date: Thu, 16 Mar 2017 17:31:28 +0100 Subject: [scikit-learn] Is something wrong with this gridsearchCV? In-Reply-To: <3665E87A-9A82-4AA5-9759-9283804F52BE@gmail.com> References: <3665E87A-9A82-4AA5-9759-9283804F52BE@gmail.com> Message-ID: <97070B35-5127-4EBF-AF0C-67A8B3BFCFE6@gmail.com> My intention with this was to shrink my dataset, to make the grid search a bit faster, and easier to go through? I guess I?ve tackled the wrong way... > Den 16. mar. 2017 kl. 17.27 skrev Sebastian Raschka : > > I am not using Keras and don?t know how nicely it plays with sklearn objects these days, but you are not giving all the data to the grid search object, which is why your model doesn?t get to see the whole dataset during grid search; i.e., you have `np.asarray(input_train[:-(len(input_train)/1000)]` > >> On Mar 16, 2017, at 11:50 AM, Carlton Banks wrote: >> >> I am currently using grid search to optimize my keras model? >> >> Something seemed a bit off during the training? >> >> https://www.dropbox.com/s/da0ztv2kqtkrfpu/Screenshot%20from%202017-03-16%2016%3A43%3A42.png?dl=0 >> >> For some reason is the training for each epoch not done for all datapoints?? >> >> What could be wrong? >> >> Here is the code: >> >> http://pastebin.com/raw/itJFm5a6 >> >> Anything that seems off? >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From noflaco at gmail.com Thu Mar 16 12:33:21 2017 From: noflaco at gmail.com (Carlton Banks) Date: Thu, 16 Mar 2017 17:33:21 +0100 Subject: [scikit-learn] Is something wrong with this gridsearchCV? In-Reply-To: References: Message-ID: <19F76B1C-3011-4FD0-A356-C31638AEB85E@gmail.com> I am running this on a super computer, so yes I am running a few training sessions. I guess i will look at the verbose, and the adjust the training data size. > Den 16. mar. 2017 kl. 17.30 skrev Julio Antonio Soto de Vicente : > > IMO this has nothing to do with GridSearchCV itself... > > It rather looks like different (verbose) keras models are being trained simultaneously, and therefore "collapsing" your stdout. > > I recommend setting Keras verbosity level to 3, in order to avoid printing the progress bars during GridSearchCV (which can be misleading). > > -- > Julio > > El 16 mar 2017, a las 16:50, Carlton Banks > escribi?: > >> I am currently using grid search to optimize my keras model? >> >> Something seemed a bit off during the training? >> >> https://www.dropbox.com/s/da0ztv2kqtkrfpu/Screenshot%20from%202017-03-16%2016%3A43%3A42.png?dl=0 >> >> For some reason is the training for each epoch not done for all datapoints?? >> >> What could be wrong? >> >> Here is the code: >> >> http://pastebin.com/raw/itJFm5a6 >> >> Anything that seems off? >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From karandesai281196 at gmail.com Thu Mar 16 12:49:49 2017 From: karandesai281196 at gmail.com (Karan Desai) Date: Thu, 16 Mar 2017 16:49:49 +0000 Subject: [scikit-learn] [GSoC 2017] First Draft, request for suggestions - Improve Online Learning of Linear Models. In-Reply-To: References: Message-ID: > The problem with callbacks is that for callbacks on each iteration to be feasible, they need to be cython functions.> Otherwise they will be too slow. You could do python callbacks, but they could not be called at every iteration, and so > they wouldn't be suitable to implement something like adagrad or adam. We can implement some plug and play callbacks in Cython and pass a list of strings in constructor of a linear model, deciding which callbacks to execute. How does that sound Andreas ? I can really help with more thoughts about the idea. Regards,Karan. On Wed, Mar 15, 2017 8:18 PM, Andreas Mueller t3kcit at gmail.com wrote: On 03/15/2017 04:48 AM, Karan Desai wrote: > 4. About a tool to anneal learning rate: I suggest a new approach to > look at this - as a callback. I searched through the documentation and > I could not find this way of handling tidbits during training of > models. We should be able to provide a callback to the constructor of > a linear model which can do any dedicated job after every epoch, be it > learning rate annealing, saving model checkpoint, getting custom > verbose output, or as creative as uploading data to server for real > time plots on any website. There has been some effort on doing adagrad but it was ultimately discontinued, I think. There was quite a bit of complexity to handle. The problem with callbacks is that for callbacks on each iteration to be feasible, they need to be cython functions. Otherwise they will be too slow. You could do python callbacks, but they could not be called at every iteration, and so they wouldn't be suitable to implement something like adagrad or adam. Best, Andy _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From noflaco at gmail.com Thu Mar 16 12:51:54 2017 From: noflaco at gmail.com (Carlton Banks) Date: Thu, 16 Mar 2017 17:51:54 +0100 Subject: [scikit-learn] Is something wrong with this gridsearchCV? In-Reply-To: <19F76B1C-3011-4FD0-A356-C31638AEB85E@gmail.com> References: <19F76B1C-3011-4FD0-A356-C31638AEB85E@gmail.com> Message-ID: Ohh.. actually the data size cannot be wrong.. input_train and output_train are both lists? which i then only take a part of ? and then make then to a np.array? So that should not be incorrect. > Den 16. mar. 2017 kl. 17.33 skrev Carlton Banks : > > I am running this on a super computer, so yes I am running a few training sessions. > I guess i will look at the verbose, and the adjust the training data size. > >> Den 16. mar. 2017 kl. 17.30 skrev Julio Antonio Soto de Vicente >: >> >> IMO this has nothing to do with GridSearchCV itself... >> >> It rather looks like different (verbose) keras models are being trained simultaneously, and therefore "collapsing" your stdout. >> >> I recommend setting Keras verbosity level to 3, in order to avoid printing the progress bars during GridSearchCV (which can be misleading). >> >> -- >> Julio >> >> El 16 mar 2017, a las 16:50, Carlton Banks > escribi?: >> >>> I am currently using grid search to optimize my keras model? >>> >>> Something seemed a bit off during the training? >>> >>> https://www.dropbox.com/s/da0ztv2kqtkrfpu/Screenshot%20from%202017-03-16%2016%3A43%3A42.png?dl=0 >>> >>> For some reason is the training for each epoch not done for all datapoints?? >>> >>> What could be wrong? >>> >>> Here is the code: >>> >>> http://pastebin.com/raw/itJFm5a6 >>> >>> Anything that seems off? >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From noflaco at gmail.com Thu Mar 16 12:59:25 2017 From: noflaco at gmail.com (Carlton Banks) Date: Thu, 16 Mar 2017 17:59:25 +0100 Subject: [scikit-learn] Is something wrong with this gridsearchCV? In-Reply-To: References: <19F76B1C-3011-4FD0-A356-C31638AEB85E@gmail.com> Message-ID: <478ED313-68AC-4036-B11E-9B7D06A714DC@gmail.com> I haven?t a verbosity level in the code?? but set it to 3 as suggested by Julio? It did not seem to work.. https://www.dropbox.com/s/nr5rattzts0wuvd/Screenshot%20from%202017-03-16%2017%3A56%3A26.png?dl=0 > Den 16. mar. 2017 kl. 17.51 skrev Carlton Banks : > > Ohh.. actually the data size cannot be wrong.. > input_train and output_train are both lists? which i then only take a part of ? and then make then to a np.array? > > So that should not be incorrect. > >> Den 16. mar. 2017 kl. 17.33 skrev Carlton Banks >: >> >> I am running this on a super computer, so yes I am running a few training sessions. >> I guess i will look at the verbose, and the adjust the training data size. >> >>> Den 16. mar. 2017 kl. 17.30 skrev Julio Antonio Soto de Vicente >: >>> >>> IMO this has nothing to do with GridSearchCV itself... >>> >>> It rather looks like different (verbose) keras models are being trained simultaneously, and therefore "collapsing" your stdout. >>> >>> I recommend setting Keras verbosity level to 3, in order to avoid printing the progress bars during GridSearchCV (which can be misleading). >>> >>> -- >>> Julio >>> >>> El 16 mar 2017, a las 16:50, Carlton Banks > escribi?: >>> >>>> I am currently using grid search to optimize my keras model? >>>> >>>> Something seemed a bit off during the training? >>>> >>>> https://www.dropbox.com/s/da0ztv2kqtkrfpu/Screenshot%20from%202017-03-16%2016%3A43%3A42.png?dl=0 >>>> >>>> For some reason is the training for each epoch not done for all datapoints?? >>>> >>>> What could be wrong? >>>> >>>> Here is the code: >>>> >>>> http://pastebin.com/raw/itJFm5a6 >>>> >>>> Anything that seems off? >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From karandesai281196 at gmail.com Thu Mar 16 13:04:47 2017 From: karandesai281196 at gmail.com (Karan Desai) Date: Thu, 16 Mar 2017 17:04:47 +0000 Subject: [scikit-learn] [GSoC 2017] First Draft, request for suggestions - Improve Online Learning of Linear Models. In-Reply-To: References: Message-ID: <056fcfb0-afec-5aaa-ff43-7c26b3e6ef1f@mixmax.com> Oh, and I forgot to mention. Some of the easily doable callbacks include: 1. Verbose Logs (maybe progress bars ? Saw an issue earlier) 2. Model Checkpoints 3. Early Stopping 4. Learning Rate annealing As a second alternative, we can boil everything down and simply define learning rate strategies, like linear, polynomial or exponential decreasing after fixed amount of epochs.For a very naive alternative, we can even end up allowing the user to provide a list having length equal to max_iter?if it is specified. But this doesn't sound too appetizing to me. Karan. -------------- next part -------------- An HTML attachment was scrubbed... URL: From julio at esbet.es Thu Mar 16 13:05:42 2017 From: julio at esbet.es (Julio Antonio Soto de Vicente) Date: Thu, 16 Mar 2017 18:05:42 +0100 Subject: [scikit-learn] Is something wrong with this gridsearchCV? In-Reply-To: <478ED313-68AC-4036-B11E-9B7D06A714DC@gmail.com> References: <19F76B1C-3011-4FD0-A356-C31638AEB85E@gmail.com> <478ED313-68AC-4036-B11E-9B7D06A714DC@gmail.com> Message-ID: Your code is perfectly fine. You are training 10 networks in parallel (since you have n_jobs=10), so each network started training in its own, and outputing its progress independently. Given enough amount of time, you will see that all 10 networks will eventually get to epoch number 2, and 10 messages of epoch #2 will be printed out. -- Julio > El 16 mar 2017, a las 17:59, Carlton Banks escribi?: > > I haven?t a verbosity level in the code?? but set it to 3 as suggested by Julio? It did not seem to work.. > > https://www.dropbox.com/s/nr5rattzts0wuvd/Screenshot%20from%202017-03-16%2017%3A56%3A26.png?dl=0 > >> Den 16. mar. 2017 kl. 17.51 skrev Carlton Banks : >> >> Ohh.. actually the data size cannot be wrong.. >> input_train and output_train are both lists? which i then only take a part of ? and then make then to a np.array? >> >> So that should not be incorrect. >> >>> Den 16. mar. 2017 kl. 17.33 skrev Carlton Banks : >>> >>> I am running this on a super computer, so yes I am running a few training sessions. >>> I guess i will look at the verbose, and the adjust the training data size. >>> >>>> Den 16. mar. 2017 kl. 17.30 skrev Julio Antonio Soto de Vicente : >>>> >>>> IMO this has nothing to do with GridSearchCV itself... >>>> >>>> It rather looks like different (verbose) keras models are being trained simultaneously, and therefore "collapsing" your stdout. >>>> >>>> I recommend setting Keras verbosity level to 3, in order to avoid printing the progress bars during GridSearchCV (which can be misleading). >>>> >>>> -- >>>> Julio >>>> >>>>> El 16 mar 2017, a las 16:50, Carlton Banks escribi?: >>>>> >>>>> I am currently using grid search to optimize my keras model? >>>>> >>>>> Something seemed a bit off during the training? >>>>> >>>>> https://www.dropbox.com/s/da0ztv2kqtkrfpu/Screenshot%20from%202017-03-16%2016%3A43%3A42.png?dl=0 >>>>> >>>>> For some reason is the training for each epoch not done for all datapoints?? >>>>> >>>>> What could be wrong? >>>>> >>>>> Here is the code: >>>>> >>>>> http://pastebin.com/raw/itJFm5a6 >>>>> >>>>> Anything that seems off? >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From noflaco at gmail.com Thu Mar 16 13:08:29 2017 From: noflaco at gmail.com (Carlton Banks) Date: Thu, 16 Mar 2017 18:08:29 +0100 Subject: [scikit-learn] Is something wrong with this gridsearchCV? In-Reply-To: References: <19F76B1C-3011-4FD0-A356-C31638AEB85E@gmail.com> <478ED313-68AC-4036-B11E-9B7D06A714DC@gmail.com> Message-ID: <9612A7D3-8F63-48AB-8B9E-CBC070330A4A@gmail.com> ahh.. makes sense.. but would have hoped i could parelize it as i have so many cores to run on.. > Den 16. mar. 2017 kl. 18.05 skrev Julio Antonio Soto de Vicente : > > Your code is perfectly fine. > > You are training 10 networks in parallel (since you have n_jobs=10), so each network started training in its own, and outputing its progress independently. > > Given enough amount of time, you will see that all 10 networks will eventually get to epoch number 2, and 10 messages of epoch #2 will be printed out. > > -- > Julio > > El 16 mar 2017, a las 17:59, Carlton Banks > escribi?: > >> I haven?t a verbosity level in the code?? but set it to 3 as suggested by Julio? It did not seem to work.. >> >> https://www.dropbox.com/s/nr5rattzts0wuvd/Screenshot%20from%202017-03-16%2017%3A56%3A26.png?dl=0 >> >>> Den 16. mar. 2017 kl. 17.51 skrev Carlton Banks >: >>> >>> Ohh.. actually the data size cannot be wrong.. >>> input_train and output_train are both lists? which i then only take a part of ? and then make then to a np.array? >>> >>> So that should not be incorrect. >>> >>>> Den 16. mar. 2017 kl. 17.33 skrev Carlton Banks >: >>>> >>>> I am running this on a super computer, so yes I am running a few training sessions. >>>> I guess i will look at the verbose, and the adjust the training data size. >>>> >>>>> Den 16. mar. 2017 kl. 17.30 skrev Julio Antonio Soto de Vicente >: >>>>> >>>>> IMO this has nothing to do with GridSearchCV itself... >>>>> >>>>> It rather looks like different (verbose) keras models are being trained simultaneously, and therefore "collapsing" your stdout. >>>>> >>>>> I recommend setting Keras verbosity level to 3, in order to avoid printing the progress bars during GridSearchCV (which can be misleading). >>>>> >>>>> -- >>>>> Julio >>>>> >>>>> El 16 mar 2017, a las 16:50, Carlton Banks > escribi?: >>>>> >>>>>> I am currently using grid search to optimize my keras model? >>>>>> >>>>>> Something seemed a bit off during the training? >>>>>> >>>>>> https://www.dropbox.com/s/da0ztv2kqtkrfpu/Screenshot%20from%202017-03-16%2016%3A43%3A42.png?dl=0 >>>>>> >>>>>> For some reason is the training for each epoch not done for all datapoints?? >>>>>> >>>>>> What could be wrong? >>>>>> >>>>>> Here is the code: >>>>>> >>>>>> http://pastebin.com/raw/itJFm5a6 >>>>>> >>>>>> Anything that seems off? >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From julio at esbet.es Thu Mar 16 14:09:51 2017 From: julio at esbet.es (Julio Antonio Soto de Vicente) Date: Thu, 16 Mar 2017 19:09:51 +0100 Subject: [scikit-learn] Is something wrong with this gridsearchCV? In-Reply-To: <9612A7D3-8F63-48AB-8B9E-CBC070330A4A@gmail.com> References: <19F76B1C-3011-4FD0-A356-C31638AEB85E@gmail.com> <478ED313-68AC-4036-B11E-9B7D06A714DC@gmail.com> <9612A7D3-8F63-48AB-8B9E-CBC070330A4A@gmail.com> Message-ID: You totally can. In fact, there's a tradeoff between how many cpu cores each network uses and the number of parallel networks you can train. By default, any Keras network will use "as many cores it makes sense for the depth of your network and the amount of data you have" (both over Theano and Tensorflow). Given you are using an convnet, chances are that each network will decide to use, probably, 5 or more. Unless, the n_jobs in your GridSearchCV is "quite high" (high depends on the number of cpu cores on your machine). If your machine has 10 cores and n_jobs=10, each Keras network will use 1 core. If n_jobs=2, each network will use 5 cores, and so on. -- Julio > El 16 mar 2017, a las 18:08, Carlton Banks escribi?: > > ahh.. makes sense.. but would have hoped i could parelize it as i have so many cores to run on.. >> Den 16. mar. 2017 kl. 18.05 skrev Julio Antonio Soto de Vicente : >> >> Your code is perfectly fine. >> >> You are training 10 networks in parallel (since you have n_jobs=10), so each network started training in its own, and outputing its progress independently. >> >> Given enough amount of time, you will see that all 10 networks will eventually get to epoch number 2, and 10 messages of epoch #2 will be printed out. >> >> -- >> Julio >> >>> El 16 mar 2017, a las 17:59, Carlton Banks escribi?: >>> >>> I haven?t a verbosity level in the code?? but set it to 3 as suggested by Julio? It did not seem to work.. >>> >>> https://www.dropbox.com/s/nr5rattzts0wuvd/Screenshot%20from%202017-03-16%2017%3A56%3A26.png?dl=0 >>> >>>> Den 16. mar. 2017 kl. 17.51 skrev Carlton Banks : >>>> >>>> Ohh.. actually the data size cannot be wrong.. >>>> input_train and output_train are both lists? which i then only take a part of ? and then make then to a np.array? >>>> >>>> So that should not be incorrect. >>>> >>>>> Den 16. mar. 2017 kl. 17.33 skrev Carlton Banks : >>>>> >>>>> I am running this on a super computer, so yes I am running a few training sessions. >>>>> I guess i will look at the verbose, and the adjust the training data size. >>>>> >>>>>> Den 16. mar. 2017 kl. 17.30 skrev Julio Antonio Soto de Vicente : >>>>>> >>>>>> IMO this has nothing to do with GridSearchCV itself... >>>>>> >>>>>> It rather looks like different (verbose) keras models are being trained simultaneously, and therefore "collapsing" your stdout. >>>>>> >>>>>> I recommend setting Keras verbosity level to 3, in order to avoid printing the progress bars during GridSearchCV (which can be misleading). >>>>>> >>>>>> -- >>>>>> Julio >>>>>> >>>>>>> El 16 mar 2017, a las 16:50, Carlton Banks escribi?: >>>>>>> >>>>>>> I am currently using grid search to optimize my keras model? >>>>>>> >>>>>>> Something seemed a bit off during the training? >>>>>>> >>>>>>> https://www.dropbox.com/s/da0ztv2kqtkrfpu/Screenshot%20from%202017-03-16%2016%3A43%3A42.png?dl=0 >>>>>>> >>>>>>> For some reason is the training for each epoch not done for all datapoints?? >>>>>>> >>>>>>> What could be wrong? >>>>>>> >>>>>>> Here is the code: >>>>>>> >>>>>>> http://pastebin.com/raw/itJFm5a6 >>>>>>> >>>>>>> Anything that seems off? >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn at python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From noflaco at gmail.com Fri Mar 17 17:46:56 2017 From: noflaco at gmail.com (Carlton Banks) Date: Fri, 17 Mar 2017 22:46:56 +0100 Subject: [scikit-learn] Intermediate results using gridsearchCV? Message-ID: <127AEC21-D123-4FD5-876E-AB74D10C66FA@gmail.com> Is it possible to receive intermediate the intermediate result of a gridsearchcv? instead getting the final result? From noflaco at gmail.com Fri Mar 17 23:36:02 2017 From: noflaco at gmail.com (Carlton Banks) Date: Sat, 18 Mar 2017 04:36:02 +0100 Subject: [scikit-learn] (no subject) Message-ID: <12CE352D-D9FC-4144-BD1D-D19AC182BA74@gmail.com> I am currently struggling with getting good results with my CNN in which i decided to optimize parameter using grid search. I am currently trying to use scikit-learn GridSearchCV. def create_model(init_mode='uniform',activation_mode='linear',optimizer_mode="adam", activation_mode_conv = 'linear'): model = Sequential() model.add(ZeroPadding2D((6,4),input_shape=(6,3,3))) model.add(Convolution2D(32,3,3 , activation=activation_mode_conv)) print model.output_shape model.add(Convolution2D(32, 3,3, activation=activation_mode_conv)) print model.output_shape model.add(MaxPooling2D(pool_size=(2,2),strides=(2,1))) print model.output_shape model.add(Convolution2D(64, 3,3 , activation=activation_mode_conv)) print model.output_shape model.add(Convolution2D(64, 3,3 , activation=activation_mode_conv)) print model.output_shape model.add(MaxPooling2D(pool_size=(2,2),strides=(2,1))) model.add(Flatten()) print model.output_shape model.add(Dense(output_dim=32, input_dim=64, init=init_mode,activation=activation_mode)) model.add(Dense(output_dim=13, input_dim=50, init=init_mode,activation=activation_mode)) model.add(Dense(output_dim=1, input_dim=13, init=init_mode,activation=activation_mode)) model.add(Dense(output_dim=1, init=init_mode, activation=activation_mode)) #print model.summary() model.compile(loss='mean_squared_error',optimizer=optimizer_mode) return model #reduce_lr=ReduceLROnPlateau(monitor='val_loss', factor=0.01, patience=3, verbose=1, mode='auto', epsilon=0.1, cooldown=0, min_lr=0.000000000000000001) #stop = EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=1, mode='auto') #log=csv_logger = CSVLogger('training_'+str(i)+'.csv') #print "Model Train" #hist_current = model.fit(np.array(data_train_input), # np.array(data_train_output), # shuffle=False, # validation_data=(np.array(data_test_input),np.array(data_test_output)), # validation_split=0.1, # nb_epoch=150000, # verbose=1, # callbacks=[reduce_lr,log,stop]) #print() #print model.summary() #print "Model stored" #model.save(spectogram_path+"Model"+str(feature)+".h5") #model.save_weights(spectogram_path+"Model"+str(feature)+"_weights.h5") #del model ## Make it work for other feature ranges ## Add the CNN part and test it ## Try with gabor kernels as suggested by the other paper.. input_train, input_test, output_train, output_test = model(0,train_input_data_interweawed_normalized[:-(len(train_input_data_interweawed_normalized)-1000)],output_data_train[:-(len(output_data_train)-1000)],test_input_data_interweawed_normalized[:-(len(test_input_data_interweawed_normalized)-1000)],output_data_test[:-(len(output_data_test)-1000)]) del test_input_data del test_name del test_input_data_normalized del test_name_normalized del test_input_data_interweawed del test_name_interweawed del test_input_data_interweawed_normalized del test_name_interweawed_normalized del train_input_data del train_name del train_input_data_normalized del train_name_normalized del train_input_data_interweawed del train_name_interweawed del train_input_data_interweawed_normalized del train_name_interweawed_normalized seed = 7 np.random.seed(seed) print "Regressor" model = KerasRegressor(build_fn = create_model, verbose = 10) init_mode_list = ['uniform', 'lecun_uniform', 'normal', 'zero', 'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform'] activation_mode_list = ['softmax', 'softplus', 'softsign', 'relu', 'tanh', 'sigmoid', 'hard_sigmoid', 'linear'] activation_mode_list_conv = ['softplus', 'softsign', 'relu', 'tanh', 'sigmoid', 'hard_sigmoid', 'linear'] optimizer_mode_list = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam'] batch_size_list = [10, 20, 40, 60, 80, 100] epochs = [10, 50, 100] param_grid = dict(init_mode=init_mode_list, batch_size=batch_size_list, nb_epoch=epochs, activation_mode=activation_mode_list, optimizer_mode = optimizer_mode_list, activation_mode_conv = activation_mode_list_conv) grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=1) print "Grid fit" grid_result = grid.fit(np.asarray(input_train), np.array(output_train)) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) This runs.. but problems with it that it only provides a result at the end. I ran the code once but then it crashed with this error message: cannot allocate memory for thread-local data: ABORT I am not sure what could cause this problem? From b113053 at iiit-bh.ac.in Sat Mar 18 00:50:10 2017 From: b113053 at iiit-bh.ac.in (Afzal Ansari) Date: Sat, 18 Mar 2017 10:20:10 +0530 Subject: [scikit-learn] Regarding Adaboost classifier Message-ID: Hello Developers! I am currently working on feature extraction method which is based on Haar features for image classification. I am unable to find pure implementation of adaboost classifier algorithm on the internet even on scikit learn web. I need to train the classifier using adaboost classifier to obtain Haar features from image dataset. Please help me regarding this code. Reply soon. Thanks in advance. -------------- next part -------------- An HTML attachment was scrubbed... URL: From francois.dion at gmail.com Sat Mar 18 13:33:43 2017 From: francois.dion at gmail.com (Francois Dion) Date: Sat, 18 Mar 2017 13:33:43 -0400 Subject: [scikit-learn] Regarding Adaboost classifier In-Reply-To: References: Message-ID: <20170318173343.5615696.68043.144630@gmail.com> An HTML attachment was scrubbed... URL: From b113053 at iiit-bh.ac.in Sun Mar 19 01:21:27 2017 From: b113053 at iiit-bh.ac.in (Afzal Ansari) Date: Sun, 19 Mar 2017 10:51:27 +0530 Subject: [scikit-learn] Regarding Adaboost classifier In-Reply-To: <20170318173343.5615696.68043.144630@gmail.com> References: <20170318173343.5615696.68043.144630@gmail.com> Message-ID: Hello Sir, I want to classify images containing negative and positive samples using Adaboost classifier. So how can I do that classification? Please help me regarding this. Thanks. On Sat, Mar 18, 2017 at 11:03 PM, Francois Dion wrote: > You need to provide more details on exactly what you need. I'll take a > stab at it: > > Are you trying to replicate OpenCV cascade training? > If so, what they call DAB is Scikit learn adaboostclassifier ( > http://scikit-learn.org/stable/modules/generated/sklearn.ensemble. > AdaBoostClassifier.html)? with algorithm=SAMME. > RAB is SAMME.R. > > > ?Francois > > > Sent from my BlackBerry 10 Darkphone > *From: *Afzal Ansari > *Sent: *Saturday, March 18, 2017 00:51 > *To: *scikit-learn at python.org > *Reply To: *Scikit-learn user and developer mailing list > *Subject: *[scikit-learn] Regarding Adaboost classifier > > Hello Developers! > I am currently working on feature extraction method which is based on > Haar features for image classification. I am unable to find pure > implementation of adaboost classifier algorithm on the internet even on > scikit learn web. I need to train the classifier using adaboost classifier > to obtain Haar features from image dataset. > Please help me regarding this code. Reply soon. > > Thanks in advance. > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Sun Mar 19 02:19:26 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Sat, 18 Mar 2017 23:19:26 -0700 Subject: [scikit-learn] Regarding Adaboost classifier In-Reply-To: References: <20170318173343.5615696.68043.144630@gmail.com> Message-ID: You really need to provide more details with what exactly you're stuck with. If you've extracted useful features from some image into a matrix X with binary labels y you can just do `clf.fit(X, y)` to train the classifier. On Sat, Mar 18, 2017 at 10:21 PM, Afzal Ansari wrote: > Hello Sir, > I want to classify images containing negative and positive samples using > Adaboost classifier. So how can I do that classification? Please help me > regarding this. > > Thanks. > > On Sat, Mar 18, 2017 at 11:03 PM, Francois Dion > wrote: > >> You need to provide more details on exactly what you need. I'll take a >> stab at it: >> >> Are you trying to replicate OpenCV cascade training? >> If so, what they call DAB is Scikit learn adaboostclassifier ( >> http://scikit-learn.org/stable/modules/generated/sklearn. >> ensemble.AdaBoostClassifier.html)? with algorithm=SAMME. >> RAB is SAMME.R. >> >> >> ?Francois >> >> >> Sent from my BlackBerry 10 Darkphone >> *From: *Afzal Ansari >> *Sent: *Saturday, March 18, 2017 00:51 >> *To: *scikit-learn at python.org >> *Reply To: *Scikit-learn user and developer mailing list >> *Subject: *[scikit-learn] Regarding Adaboost classifier >> >> Hello Developers! >> I am currently working on feature extraction method which is based on >> Haar features for image classification. I am unable to find pure >> implementation of adaboost classifier algorithm on the internet even on >> scikit learn web. I need to train the classifier using adaboost classifier >> to obtain Haar features from image dataset. >> Please help me regarding this code. Reply soon. >> >> Thanks in advance. >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From b113053 at iiit-bh.ac.in Sun Mar 19 02:57:27 2017 From: b113053 at iiit-bh.ac.in (Afzal Ansari) Date: Sun, 19 Mar 2017 12:27:27 +0530 Subject: [scikit-learn] Regarding Adaboost classifier In-Reply-To: References: <20170318173343.5615696.68043.144630@gmail.com> Message-ID: Thank you for your response. First I want to extract useful features from images so as to get n_features. So can you suggest any method to extract features from image(24*24) dataset? Then I can possibly train the classifier. Thanks. On Sun, Mar 19, 2017 at 11:49 AM, Jacob Schreiber wrote: > You really need to provide more details with what exactly you're stuck > with. If you've extracted useful features from some image into a matrix X > with binary labels y you can just do `clf.fit(X, y)` to train the > classifier. > > On Sat, Mar 18, 2017 at 10:21 PM, Afzal Ansari > wrote: > >> Hello Sir, >> I want to classify images containing negative and positive samples using >> Adaboost classifier. So how can I do that classification? Please help me >> regarding this. >> >> Thanks. >> >> On Sat, Mar 18, 2017 at 11:03 PM, Francois Dion >> wrote: >> >>> You need to provide more details on exactly what you need. I'll take a >>> stab at it: >>> >>> Are you trying to replicate OpenCV cascade training? >>> If so, what they call DAB is Scikit learn adaboostclassifier ( >>> http://scikit-learn.org/stable/modules/generated/sklearn.en >>> semble.AdaBoostClassifier.html)? with algorithm=SAMME. >>> RAB is SAMME.R. >>> >>> >>> ?Francois >>> >>> >>> Sent from my BlackBerry 10 Darkphone >>> *From: *Afzal Ansari >>> *Sent: *Saturday, March 18, 2017 00:51 >>> *To: *scikit-learn at python.org >>> *Reply To: *Scikit-learn user and developer mailing list >>> *Subject: *[scikit-learn] Regarding Adaboost classifier >>> >>> Hello Developers! >>> I am currently working on feature extraction method which is based on >>> Haar features for image classification. I am unable to find pure >>> implementation of adaboost classifier algorithm on the internet even on >>> scikit learn web. I need to train the classifier using adaboost classifier >>> to obtain Haar features from image dataset. >>> Please help me regarding this code. Reply soon. >>> >>> Thanks in advance. >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Sun Mar 19 06:16:43 2017 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Sun, 19 Mar 2017 11:16:43 +0100 Subject: [scikit-learn] Regarding Adaboost classifier In-Reply-To: References: <20170318173343.5615696.68043.144630@gmail.com> Message-ID: I want just to recap a few things: > I need to train the classifier using adaboost classifier to obtain Haar features from image dataset > So can you suggest any method to extract features from image(24*24) datase You just mentioned what was your requirement regarding the feature to extract -> Haar features. My feeling is that you want to reimplement the paper of Viola and Jones for face detection. So you could check with the folks of scikit-image if they have something related -> https://github.com/scikit-image/scikit-image/pull/1444 You could also check opencv which offer functions, classe, and helper -> http://docs.opencv.org/trunk/d7/d8b/tutorial_py_face_detection.html / http://docs.opencv.org/2.4/modules/objdetect/doc/cascade_classification.html At the end, sklearn can help you with the AdaBoostClassifier, ranking of the features, and the evaluation of the pipeline. On 19 March 2017 at 07:57, Afzal Ansari wrote: > Thank you for your response. First I want to extract useful features from > images so as to get n_features. So can you suggest any method to extract > features from image(24*24) dataset? Then I can possibly train the > classifier. > > Thanks. > > On Sun, Mar 19, 2017 at 11:49 AM, Jacob Schreiber > wrote: > >> You really need to provide more details with what exactly you're stuck >> with. If you've extracted useful features from some image into a matrix X >> with binary labels y you can just do `clf.fit(X, y)` to train the >> classifier. >> >> On Sat, Mar 18, 2017 at 10:21 PM, Afzal Ansari >> wrote: >> >>> Hello Sir, >>> I want to classify images containing negative and positive samples >>> using Adaboost classifier. So how can I do that classification? Please help >>> me regarding this. >>> >>> Thanks. >>> >>> On Sat, Mar 18, 2017 at 11:03 PM, Francois Dion >> > wrote: >>> >>>> You need to provide more details on exactly what you need. I'll take a >>>> stab at it: >>>> >>>> Are you trying to replicate OpenCV cascade training? >>>> If so, what they call DAB is Scikit learn adaboostclassifier ( >>>> http://scikit-learn.org/stable/modules/generated/sklearn.en >>>> semble.AdaBoostClassifier.html)? with algorithm=SAMME. >>>> RAB is SAMME.R. >>>> >>>> >>>> ?Francois >>>> >>>> >>>> Sent from my BlackBerry 10 Darkphone >>>> *From: *Afzal Ansari >>>> *Sent: *Saturday, March 18, 2017 00:51 >>>> *To: *scikit-learn at python.org >>>> *Reply To: *Scikit-learn user and developer mailing list >>>> *Subject: *[scikit-learn] Regarding Adaboost classifier >>>> >>>> Hello Developers! >>>> I am currently working on feature extraction method which is based on >>>> Haar features for image classification. I am unable to find pure >>>> implementation of adaboost classifier algorithm on the internet even on >>>> scikit learn web. I need to train the classifier using adaboost classifier >>>> to obtain Haar features from image dataset. >>>> Please help me regarding this code. Reply soon. >>>> >>>> Thanks in advance. >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Mar 19 06:46:20 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Sun, 19 Mar 2017 21:46:20 +1100 Subject: [scikit-learn] Intermediate results using gridsearchCV? In-Reply-To: <127AEC21-D123-4FD5-876E-AB74D10C66FA@gmail.com> References: <127AEC21-D123-4FD5-876E-AB74D10C66FA@gmail.com> Message-ID: Not sure what you mean. Have you used cv_results_ On 18 March 2017 at 08:46, Carlton Banks wrote: > Is it possible to receive intermediate the intermediate result of a > gridsearchcv? > > instead getting the final result? > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vaggi.federico at gmail.com Sun Mar 19 06:49:30 2017 From: vaggi.federico at gmail.com (federico vaggi) Date: Sun, 19 Mar 2017 10:49:30 +0000 Subject: [scikit-learn] Intermediate results using gridsearchCV? In-Reply-To: References: <127AEC21-D123-4FD5-876E-AB74D10C66FA@gmail.com> Message-ID: I imagine he is suggesting to have an iterator that yields results while it's running, instead of only getting the result at the end of the run. On Sun, 19 Mar 2017 at 11:46 Joel Nothman wrote: > Not sure what you mean. Have you used cv_results_ > > On 18 March 2017 at 08:46, Carlton Banks wrote: > > Is it possible to receive intermediate the intermediate result of a > gridsearchcv? > > instead getting the final result? > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Mar 19 07:14:45 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Sun, 19 Mar 2017 22:14:45 +1100 Subject: [scikit-learn] Intermediate results using gridsearchCV? In-Reply-To: References: <127AEC21-D123-4FD5-876E-AB74D10C66FA@gmail.com> Message-ID: Best bet for that at the moment is write a wrapper or mixin for your base estimator. On 19 March 2017 at 21:49, federico vaggi wrote: > I imagine he is suggesting to have an iterator that yields results while > it's running, instead of only getting the result at the end of the run. > > On Sun, 19 Mar 2017 at 11:46 Joel Nothman wrote: > >> Not sure what you mean. Have you used cv_results_ >> >> On 18 March 2017 at 08:46, Carlton Banks wrote: >> >> Is it possible to receive intermediate the intermediate result of a >> gridsearchcv? >> >> instead getting the final result? >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From b113053 at iiit-bh.ac.in Sun Mar 19 09:19:08 2017 From: b113053 at iiit-bh.ac.in (Afzal Ansari) Date: Sun, 19 Mar 2017 18:49:08 +0530 Subject: [scikit-learn] Regarding Adaboost classifier In-Reply-To: References: <20170318173343.5615696.68043.144630@gmail.com> Message-ID: Thank you for your quick kind response. You got what I exactly want to know. Now I can expect my pre-processing methods are to be done. And also I have got clear now from this sklearn can help you with the AdaBoostClassifier, ranking of the features, and the evaluation of the pipeline. On Sun, Mar 19, 2017 at 3:46 PM, Guillaume Lema?tre wrote: > I want just to recap a few things: > > > I need to train the classifier using adaboost classifier to obtain Haar > features from image dataset > > So can you suggest any method to extract features from image(24*24) > datase > > You just mentioned what was your requirement regarding the feature to > extract -> Haar features. > My feeling is that you want to reimplement the paper of Viola and Jones > for face detection. > > So you could check with the folks of scikit-image if they have something > related -> https://github.com/scikit-image/scikit-image/pull/1444 > You could also check opencv which offer functions, classe, and helper -> > http://docs.opencv.org/trunk/d7/d8b/tutorial_py_face_detection.html / > http://docs.opencv.org/2.4/modules/objdetect/doc/cascade_ > classification.html > > At the end, sklearn can help you with the AdaBoostClassifier, ranking of > the features, and the evaluation of the pipeline. > > > On 19 March 2017 at 07:57, Afzal Ansari wrote: > >> Thank you for your response. First I want to extract useful features from >> images so as to get n_features. So can you suggest any method to extract >> features from image(24*24) dataset? Then I can possibly train the >> classifier. >> >> Thanks. >> >> On Sun, Mar 19, 2017 at 11:49 AM, Jacob Schreiber < >> jmschreiber91 at gmail.com> wrote: >> >>> You really need to provide more details with what exactly you're stuck >>> with. If you've extracted useful features from some image into a matrix X >>> with binary labels y you can just do `clf.fit(X, y)` to train the >>> classifier. >>> >>> On Sat, Mar 18, 2017 at 10:21 PM, Afzal Ansari >>> wrote: >>> >>>> Hello Sir, >>>> I want to classify images containing negative and positive samples >>>> using Adaboost classifier. So how can I do that classification? Please help >>>> me regarding this. >>>> >>>> Thanks. >>>> >>>> On Sat, Mar 18, 2017 at 11:03 PM, Francois Dion < >>>> francois.dion at gmail.com> wrote: >>>> >>>>> You need to provide more details on exactly what you need. I'll take a >>>>> stab at it: >>>>> >>>>> Are you trying to replicate OpenCV cascade training? >>>>> If so, what they call DAB is Scikit learn adaboostclassifier ( >>>>> http://scikit-learn.org/stable/modules/generated/sklearn.en >>>>> semble.AdaBoostClassifier.html)? with algorithm=SAMME. >>>>> RAB is SAMME.R. >>>>> >>>>> >>>>> ?Francois >>>>> >>>>> >>>>> Sent from my BlackBerry 10 Darkphone >>>>> *From: *Afzal Ansari >>>>> *Sent: *Saturday, March 18, 2017 00:51 >>>>> *To: *scikit-learn at python.org >>>>> *Reply To: *Scikit-learn user and developer mailing list >>>>> *Subject: *[scikit-learn] Regarding Adaboost classifier >>>>> >>>>> Hello Developers! >>>>> I am currently working on feature extraction method which is based on >>>>> Haar features for image classification. I am unable to find pure >>>>> implementation of adaboost classifier algorithm on the internet even on >>>>> scikit learn web. I need to train the classifier using adaboost classifier >>>>> to obtain Haar features from image dataset. >>>>> Please help me regarding this code. Reply soon. >>>>> >>>>> Thanks in advance. >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Sun Mar 19 15:47:36 2017 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Sun, 19 Mar 2017 20:47:36 +0100 Subject: [scikit-learn] recommended feature selection method to train an MLPRegressor Message-ID: Which of the following methods would you recommend to select good features (<=50) from a set of 534 features in order to train a MLPregressor? Please take into account that the datasets I use for training are small. http://scikit-learn.org/stable/modules/feature_selection.html And please don't tell me to use a neural network that supports the dropout or any other algorithm for feature elimination. This is not applicable in my case because I want to know the best 50 features in order to append them to other types of feature that I am confident that are important. ?cheers Thomas? -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Sun Mar 19 18:23:07 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Sun, 19 Mar 2017 18:23:07 -0400 Subject: [scikit-learn] recommended feature selection method to train an MLPRegressor In-Reply-To: References: Message-ID: <6b490067-962e-02fc-5157-9a487fc1aa83@gmail.com> On 03/19/2017 03:47 PM, Thomas Evangelidis wrote: > Which of the following methods would you recommend to select good > features (<=50) from a set of 534 features in order to train a > MLPregressor? Please take into account that the datasets I use for > training are small. > > http://scikit-learn.org/stable/modules/feature_selection.html > > And please don't tell me to use a neural network that supports the > dropout or any other algorithm for feature elimination. This is not > applicable in my case because I want to know the best 50 features in > order to append them to other types of feature that I am confident > that are important. > You can always use forward or backward selection as implemented in mlxtend if you're patient. As your dataset is small that might work. However, it might be hard tricky to get the MLP to run consistently - though maybe not... -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Sun Mar 19 19:32:45 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Sun, 19 Mar 2017 19:32:45 -0400 Subject: [scikit-learn] recommended feature selection method to train an MLPRegressor In-Reply-To: <6b490067-962e-02fc-5157-9a487fc1aa83@gmail.com> References: <6b490067-962e-02fc-5157-9a487fc1aa83@gmail.com> Message-ID: Hm, that?s tricky. I think the other methods listed on http://scikit-learn.org/stable/modules/feature_selection.html could help regarding a computationally cheap solution, but the problem would be that they probably wouldn?t work that well for an MLP due to the linear assumption. And an exhaustive sampling of all subsets would also be impractical/impossible. For all 50 feature subsets, you already have 73353053308199416032348518540326808282134507009732998441913227684085760 combinations :P. A greedy solution like forward or backward selection would be more feasible, but still very expensive in combination with an MLP. On top of that, you also have to consider that neural networks are generally pretty sensitive to hyperparameter settings. So even if you fix the architecture, you probably still want to check if the learning rate etc. is appropriate for each combination of features (by checking the cost and validation error during training). PS: I wouldn?t dismiss dropout, imho. Especially because your training set is small, it could be even crucial to reduce overfitting. I mean it doesn?t remove features from your dataset but just helps the network to rely on particular combinations of features to be always present during training. Your final network will still process all features and dropout will effectively cause your network to ?use? more of those features in your ~50 feature subset compared to no dropout (because otherwise, it may just learn to rely of a subset of these 50 features). > On Mar 19, 2017, at 6:23 PM, Andreas Mueller wrote: > > > > On 03/19/2017 03:47 PM, Thomas Evangelidis wrote: >> Which of the following methods would you recommend to select good features (<=50) from a set of 534 features in order to train a MLPregressor? Please take into account that the datasets I use for training are small. >> >> http://scikit-learn.org/stable/modules/feature_selection.html >> >> And please don't tell me to use a neural network that supports the dropout or any other algorithm for feature elimination. This is not applicable in my case because I want to know the best 50 features in order to append them to other types of feature that I am confident that are important. >> > You can always use forward or backward selection as implemented in mlxtend if you're patient. As your dataset is small that might work. > However, it might be hard tricky to get the MLP to run consistently - though maybe not... > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From coderain1 at gmail.com Mon Mar 20 02:05:59 2017 From: coderain1 at gmail.com (John Doe) Date: Mon, 20 Mar 2017 11:35:59 +0530 Subject: [scikit-learn] Anomaly/Outlier detection based on user access for a large application Message-ID: Hi All, I am trying to solve a problem of finding Anomalies/Outliers using application logs of a large KMS. Please find the details below: *Problem Statement*: Find Anomalies/outliers using application access logs in an un-supervised learning environment. Basic use case is to find any suspicious activity by user/group, that deviates from a trend that the algorithm has learned. *Input Data*: Data would be created from log file that are in the following format: "ts, src_ip, decrypt, user_a, group_b, kms_region, key" Where: *ts* : time of access in epoch Eg: 1489840335 *decrypt* : is one of the various possible actions. *user_a*, *group_a* : are the user and group that did the access *kms_region* : the region in which the key exists *key* : the key that was accessed *Train Set*: This comes under the un-supervised learning and hence we cant have a "normal" training set which the model can learn. *Example of anomalies*: 1. User A suddenly accessing from a different IP: xx.yy 2. No. of access for a given key going up suddenly for a given user,key pair 3. Increased access on a generally quite long weekend 4. Increased access on a Thu (compared to last Thursdays) 5. Unusual sequences of actions for a given user. Eg. read, decrypt, delete in quick succession for all keys for a given user ------------------------ >From our research, we have come up with below list of algorithms that are applied to similar problems: - ARIMA : This might be good for timeseries predicting, but will it also learn to flag anomalies like #3, #4, sequences of actions(#5) etc? - scikit-learn's Novelty and Outlier Detection : Not sure if these will address #3, #4 and #5 use cases above. - Neural Networks - k-nearest neighbors - Clustering-Based Anomaly Detection Techniques: k-Means Clustering etc - Parametric Techniques (See Section 7): This might work well on continuous variables, but will it work on discrete features like, is_weekday etc? Also will it cover cases like #4 and #5 above? Most of the research I did were on problems that had continuous features and did not consider discrete variables like "Holiday_today?" / succession of events etc. Any feedback on the algorithm / technique that can be used for above usecases would be highly appreciated. Thanks. Regards, John. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark_stratford at optum.com Mon Mar 20 05:32:43 2017 From: mark_stratford at optum.com (Stratford, Mark A) Date: Mon, 20 Mar 2017 09:32:43 +0000 Subject: [scikit-learn] Please unsubscribe Message-ID: -----Original Message----- From: scikit-learn [mailto:scikit-learn-bounces+mark_stratford=optum.com at python.org] On Behalf Of scikit-learn-request at python.org Sent: Monday, March 20, 2017 6:06 AM To: scikit-learn at python.org Subject: scikit-learn Digest, Vol 12, Issue 42 Send scikit-learn mailing list submissions to scikit-learn at python.org To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/scikit-learn or, via email, send a message with subject or body 'help' to scikit-learn-request at python.org You can reach the person managing the list at scikit-learn-owner at python.org When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..." Today's Topics: 1. recommended feature selection method to train an MLPRegressor (Thomas Evangelidis) 2. Re: recommended feature selection method to train an MLPRegressor (Andreas Mueller) 3. Re: recommended feature selection method to train an MLPRegressor (Sebastian Raschka) 4. Anomaly/Outlier detection based on user access for a large application (John Doe) ---------------------------------------------------------------------- Message: 1 Date: Sun, 19 Mar 2017 20:47:36 +0100 From: Thomas Evangelidis To: Scikit-learn user and developer mailing list Subject: [scikit-learn] recommended feature selection method to train an MLPRegressor Message-ID: Content-Type: text/plain; charset="utf-8" Which of the following methods would you recommend to select good features (<=50) from a set of 534 features in order to train a MLPregressor? Please take into account that the datasets I use for training are small. http://scikit-learn.org/stable/modules/feature_selection.html And please don't tell me to use a neural network that supports the dropout or any other algorithm for feature elimination. This is not applicable in my case because I want to know the best 50 features in order to append them to other types of feature that I am confident that are important. ?cheers Thomas? -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Sun, 19 Mar 2017 18:23:07 -0400 From: Andreas Mueller To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] recommended feature selection method to train an MLPRegressor Message-ID: <6b490067-962e-02fc-5157-9a487fc1aa83 at gmail.com> Content-Type: text/plain; charset="windows-1252"; Format="flowed" On 03/19/2017 03:47 PM, Thomas Evangelidis wrote: > Which of the following methods would you recommend to select good > features (<=50) from a set of 534 features in order to train a > MLPregressor? Please take into account that the datasets I use for > training are small. > > http://scikit-learn.org/stable/modules/feature_selection.html > > And please don't tell me to use a neural network that supports the > dropout or any other algorithm for feature elimination. This is not > applicable in my case because I want to know the best 50 features in > order to append them to other types of feature that I am confident > that are important. > You can always use forward or backward selection as implemented in mlxtend if you're patient. As your dataset is small that might work. However, it might be hard tricky to get the MLP to run consistently - though maybe not... -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 3 Date: Sun, 19 Mar 2017 19:32:45 -0400 From: Sebastian Raschka To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] recommended feature selection method to train an MLPRegressor Message-ID: Content-Type: text/plain; charset=utf-8 Hm, that?s tricky. I think the other methods listed on http://scikit-learn.org/stable/modules/feature_selection.html could help regarding a computationally cheap solution, but the problem would be that they probably wouldn?t work that well for an MLP due to the linear assumption. And an exhaustive sampling of all subsets would also be impractical/impossible. For all 50 feature subsets, you already have 73353053308199416032348518540326808282134507009732998441913227684085760 combinations :P. A greedy solution like forward or backward selection would be more feasible, but still very expensive in combination with an MLP. On top of that, you also have to consider that neural networks are generally pretty sensitive to hyperparameter settings. So even if you fix the architecture, you probably still want to check if the learning rate etc. is appropriate for each combination of features (by checking the cost and validation error during training). PS: I wouldn?t dismiss dropout, imho. Especially because your training set is small, it could be even crucial to reduce overfitting. I mean it doesn?t remove features from your dataset but just helps the network to rely on particular combinations of features to be always present during training. Your final network will still process all features and dropout will effectively cause your network to ?use? more of those features in your ~50 feature subset compared to no dropout (because otherwise, it may just learn to rely of a subset of these 50 features). > On Mar 19, 2017, at 6:23 PM, Andreas Mueller wrote: > > > > On 03/19/2017 03:47 PM, Thomas Evangelidis wrote: >> Which of the following methods would you recommend to select good features (<=50) from a set of 534 features in order to train a MLPregressor? Please take into account that the datasets I use for training are small. >> >> http://scikit-learn.org/stable/modules/feature_selection.html >> >> And please don't tell me to use a neural network that supports the dropout or any other algorithm for feature elimination. This is not applicable in my case because I want to know the best 50 features in order to append them to other types of feature that I am confident that are important. >> > You can always use forward or backward selection as implemented in mlxtend if you're patient. As your dataset is small that might work. > However, it might be hard tricky to get the MLP to run consistently - though maybe not... > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn ------------------------------ Message: 4 Date: Mon, 20 Mar 2017 11:35:59 +0530 From: John Doe To: scikit-learn at python.org Subject: [scikit-learn] Anomaly/Outlier detection based on user access for a large application Message-ID: Content-Type: text/plain; charset="utf-8" Hi All, I am trying to solve a problem of finding Anomalies/Outliers using application logs of a large KMS. Please find the details below: *Problem Statement*: Find Anomalies/outliers using application access logs in an un-supervised learning environment. Basic use case is to find any suspicious activity by user/group, that deviates from a trend that the algorithm has learned. *Input Data*: Data would be created from log file that are in the following format: "ts, src_ip, decrypt, user_a, group_b, kms_region, key" Where: *ts* : time of access in epoch Eg: 1489840335 *decrypt* : is one of the various possible actions. *user_a*, *group_a* : are the user and group that did the access *kms_region* : the region in which the key exists *key* : the key that was accessed *Train Set*: This comes under the un-supervised learning and hence we cant have a "normal" training set which the model can learn. *Example of anomalies*: 1. User A suddenly accessing from a different IP: xx.yy 2. No. of access for a given key going up suddenly for a given user,key pair 3. Increased access on a generally quite long weekend 4. Increased access on a Thu (compared to last Thursdays) 5. Unusual sequences of actions for a given user. Eg. read, decrypt, delete in quick succession for all keys for a given user ------------------------ >From our research, we have come up with below list of algorithms that >are applied to similar problems: - ARIMA : This might be good for timeseries predicting, but will it also learn to flag anomalies like #3, #4, sequences of actions(#5) etc? - scikit-learn's Novelty and Outlier Detection : Not sure if these will address #3, #4 and #5 use cases above. - Neural Networks - k-nearest neighbors - Clustering-Based Anomaly Detection Techniques: k-Means Clustering etc - Parametric Techniques (See Section 7): This might work well on continuous variables, but will it work on discrete features like, is_weekday etc? Also will it cover cases like #4 and #5 above? Most of the research I did were on problems that had continuous features and did not consider discrete variables like "Holiday_today?" / succession of events etc. Any feedback on the algorithm / technique that can be used for above usecases would be highly appreciated. Thanks. Regards, John. -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Subject: Digest Footer _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn ------------------------------ End of scikit-learn Digest, Vol 12, Issue 42 ******************************************** This e-mail, including attachments, may include confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed. If the reader of this e-mail is not the intended recipient or his or her authorized agent, the reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately. From zajac.zygmunt at gmail.com Mon Mar 20 13:45:43 2017 From: zajac.zygmunt at gmail.com (=?UTF-8?Q?Zygmunt_Zaj=c4=85c?=) Date: Mon, 20 Mar 2017 18:45:43 +0100 Subject: [scikit-learn] A custom loss function for GradientBoostingRegressor Message-ID: <17f13e76-528d-eb9a-ba4b-b8f6f5aaaf8d@gmail.com> Hello, I would like to add a custom loss function for gradient boosting regression. The function is similar to least squares, except that for each example it is OK to either undershoot or overshoot the target - loss is zero then. There is an additional binary indicator called "under" telling us whether it is OK to undershoot or overshoot. For example: y under p loss 5 1 4 0 5 0 4 1 5 1 6 1 5 0 6 0 Below is my attempt at implementation. I have three questions: 1. Is it correct? 2. How would you pass "under" to the loss function? 3. Functions other than LeastSquaresError() seem to _update_terminal_regions_. Is this necessary in this case, and if so, how to do it? def __call__(self, y, pred, sample_weight=None): if sample_weight is None: squares = (y - pred.ravel()) ** 2.0 # the custom part overshoot_ok = (pred > y) & (under == 0) undershoot_ok = (pred < y) & (under == 1) squares[overshoot_ok] = 0 squares[undershoot_ok] = 0 return np.mean(squares) else: (...) def negative_gradient(self, y, pred, **kargs): diffs = y - pred.ravel() overshoot_ok = (pred > y) & (under == 0) undershoot_ok = (pred < y) & (under == 1) diffs[overshoot_ok] = 0 diffs[undershoot_ok] = 0 return diffs -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Tue Mar 21 18:27:46 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Tue, 21 Mar 2017 15:27:46 -0700 Subject: [scikit-learn] GSoC 2017 Message-ID: Starting yesterday, students were able to submit their proposals on the GSoC website. Please review this site thoroughly before making a submission. We're eager to hear what prospective students have in mind for a contribution to sklearn. As we've said before, mentor time is at a premium this year. If you've posted a proposal and we haven't responded, please keep poking us. I know that personally I tend to wake up to between 30-70 emails and have to triage based on my availability, and that Gael likely scoffs at this small number. Things fall through the cracks. If you haven't heard back that doesn't mean we don't want your submission, please submit or ask for feedback! A strong factor in determining if you're going to be chosen will be your availability with the code and methods you'd like to work on. It is less likely that we will take someone unfamiliar with the code base this year, as there is a large starting cost to getting familiar with an intricate code-base. In your application please emphasize your prior experience with either sklearn code, cython code (if applicable for your project) or machine learning code in general. Let us know if you have any other questions. Jacob -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Wed Mar 22 01:10:27 2017 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Wed, 22 Mar 2017 14:10:27 +0900 Subject: [scikit-learn] Note of appreciation to Scikit-learn team Message-ID: To all organizers, developers, and maintainers involved in the Scikit-learn project, I would like to share a recent article that researchers from MIT, ETH, and Kyoto University (myself) have published about building efficient models for drug discovery and pharmaceutical data mining. In short, it demonstrates through replicate experiment that neither big data nor complex AI such as deep learning are necessary for efficient drug discovery, and that active learning can guide/assist decision making processes in the real world. The paper's success is underpinned by the use of Scikit-learn's RandomForestClassifier implementation combined with other techniques developed in the work. Therefore, it is a by-product of the volunteerism, hard work, and dedication by those involved in scikit-learn. As the senior author of this study, I wish to share my great appreciation for your efforts. While I am strongly limited in time and can barely contribute to this community, I cannot thank all of you enough for your work - it has made an impact. We are working on theoretical extensions of the work now, as well as pushing the technology forward in applied discovery sciences (in agricultural, pharmaceutical, and medical areas). In the theory and real-world applications, scikit-learn is indispensible. We have made the paper open access, and hope that such will inspire this community as well as those in applied sciences. You will see that the open source software community has been listed in the Acknowledgments. Certainly, we would welcome even the most casual of comments about the paper. The paper can be retrieved from here: http://www.future-science.com/doi/abs/10.4155/fmc-2016-0197 With kindest regards and sincere appreciation, J.B. Brown Kyoto University Graduate School of Medicine Junior Associate Professor and Principal Investigator http://statlsi.med.kyoto-u.ac.jp/~jbbrown PS - To those of you involved in the matplotlib, scipy, and numpy projects, your forwarding of this to those projects would be appreciated. They were also critical. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Wed Mar 22 02:57:28 2017 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Wed, 22 Mar 2017 07:57:28 +0100 Subject: [scikit-learn] Note of appreciation to Scikit-learn team In-Reply-To: References: Message-ID: <20170322065728.GG1835179@phare.normalesup.org> I would just like to say: thank you for stepping up and telling the team. It is a simple fact of life that a development team is more likely to hear about failures than success: people ask for help, or report bugs, when there are having problems. So thank you! It's important and very motivational. Scientific innovation like yours is what inspires me. Ga?l On Wed, Mar 22, 2017 at 02:10:27PM +0900, Brown J.B. wrote: > To all organizers, developers, and maintainers involved in the Scikit-learn > project, > I would like to share a recent article that researchers from MIT, ETH, and > Kyoto University (myself) have published about building efficient models for > drug discovery and pharmaceutical data mining. > In short, it demonstrates through replicate experiment that neither big data > nor complex AI such as deep learning are necessary for efficient drug > discovery, and that active learning can guide/assist decision making processes > in the real world. > The paper's success is underpinned by the use of Scikit-learn's > RandomForestClassifier implementation combined with other techniques developed > in the work. > Therefore, it is a by-product of the volunteerism, hard work, and dedication by > those involved in scikit-learn. > As the senior author of this study, I wish to share my great appreciation for > your efforts. > While I am strongly limited in time and can barely contribute to this > community, I cannot thank all of you enough for your work - it has made an > impact. > We are working on theoretical extensions of the work now, as well as pushing > the technology forward in applied discovery sciences (in agricultural, > pharmaceutical, and medical areas).? In the theory and real-world applications, > scikit-learn is indispensible. > We have made the paper open access, and hope that such will inspire this > community as well as those in applied sciences. > You will see that the open source software community has been listed in the > Acknowledgments. > Certainly, we would welcome even the most casual of comments about the paper. > The paper can be retrieved from here: > http://www.future-science.com/doi/abs/10.4155/fmc-2016-0197 > With kindest regards and sincere appreciation, > J.B. Brown > Kyoto University Graduate School of Medicine > Junior Associate Professor and Principal Investigator > http://statlsi.med.kyoto-u.ac.jp/~jbbrown > PS - To those of you involved in the matplotlib, scipy, and numpy projects, > your forwarding of this to those projects would be appreciated.? They were also > critical. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From amanpratik10 at gmail.com Wed Mar 22 03:55:45 2017 From: amanpratik10 at gmail.com (Aman Pratik) Date: Wed, 22 Mar 2017 13:25:45 +0530 Subject: [scikit-learn] GSoC 2017 : "Parallel Decision Tree Building" Message-ID: Hello Developers, This is Aman Pratik. I am currently pursuing my B.Tech from Indian Institute of Technology, Varanasi. After doing some research I have found some material on Decision Trees and Parallelization. Hence, I propose my first draft for the project "Parallel Decision Tree Building" for GSoC 2017. Proposal : First Draft Why me? I have been working in Python for the past 2 years and have good idea about Machine Learning algorithms. I am quite familiar with scikit-learn both as a user and a developer. These are the issues/PRs I have worked/working on for the past few months. [MRG+1] Issue#5803 : Regression Test added #8112 [MRG] Issue#6673:Make a wrapper around functions that score an individual feature #8038 [MRG] Issue #7987: Embarrassingly parallel "n_restarts_optimizer" in GaussianProcessRegressor #7997 My GitHub Profile: amanp10 I have worked with parallelization in one of my PR, so I am not new to it. I have used cython a couple of times, though as a beginner. I have not used Decision Tree much but I am familiar with the theory and algorithm. Also, I am familiar with Benchmark tests, Unit tests and other technical knowledge I would require for this project. Meanwhile, I have started my study for the subject and gaining experience with Cython. I am looking forward to guidance from the potential mentors or anyone willing to help. Thank You -------------- next part -------------- An HTML attachment was scrubbed... URL: From shubham.bhardwaj2015 at vit.ac.in Wed Mar 22 09:13:12 2017 From: shubham.bhardwaj2015 at vit.ac.in (SHUBHAM BHARDWAJ 15BCE0704) Date: Wed, 22 Mar 2017 18:43:12 +0530 Subject: [scikit-learn] GSoc, 2017 (proposal idea and intro) .reg In-Reply-To: References: Message-ID: Hello Sir, Added benchmarks, kindly let me know further improvements and that whether if its a good idea to consider the next parts listed in the to-do list of my pr for proposal.Thanks. pr: https://github.com/scikit-learn/scikit-learn/pull/8585 Regards Shubham Bhardwaj On Wed, Mar 15, 2017 at 10:58 PM, SHUBHAM BHARDWAJ 15BCE0704 < shubham.bhardwaj2015 at vit.ac.in> wrote: > Hello Sir, > > Greetings. I have coded a sequential version of Scalable Kmeans++ (#8585) > and have included a test script for testing it in the pr's description. > https://github.com/scikit-learn/scikit-learn/pull/8585. > > Regards > Shubham Bhardwaj > > On Tue, Mar 14, 2017 at 3:59 AM, Shreyas Saligrama chandrakan < > ssaligra at hawk.iit.edu> wrote: > >> Hi, >> >> Is it possible for me to contribute a library to introduce SVM's with >> tree kernel (like current available one in svmlight) which is currently >> missing in scikit-learn? >> >> Best, >> Shreyas >> >> On 5 Mar 2017 11:03 a.m., "Andreas Mueller" wrote: >> >>> There was a PR here: >>> https://github.com/scikit-learn/scikit-learn/pull/5530 >>> >>> but it didn't seem to work. Feel free to convince us otherwise ;) >>> >>> >>> On 03/02/2017 08:23 PM, SHUBHAM BHARDWAJ 15BCE0704 wrote: >>> >>> Hello Sir, >>> Very Sorry for the numbers I saw this written in the comments.I assumed >>> -Given the person who suggested the paper might have taken a look into the >>> number of citations.I will make sure to personally check myself. >>> >>> Regards >>> Shubham Bhardwaj >>> >>> On Fri, Mar 3, 2017 at 6:40 AM, Guillaume Lema?tre < >>> g.lemaitre58 at gmail.com> wrote: >>> >>>> I think that you mean this paper -> Scalable K-Means++ -> 218 citations >>>> >>>> On 3 March 2017 at 02:00, SHUBHAM BHARDWAJ 15BCE0704 < >>>> shubham.bhardwaj2015 at vit.ac.in> wrote: >>>> >>>>> Hello Sir, >>>>> >>>>> Thanks a lot for the reply. Sorry for not being elaborate about what I >>>>> was trying to address. I wanted to implement this [ >>>>> http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf] (1200+citations)- >>>>> mentioned in comments. This pertains to the stalled issue #4357 .Proposal >>>>> idea - implementing a scalable kmeans++. >>>>> >>>>> Regards >>>>> Shubham Bhardwaj >>>>> >>>>> On Fri, Mar 3, 2017 at 12:01 AM, Jacob Schreiber < >>>>> jmschreiber91 at gmail.com> wrote: >>>>> >>>>>> Hi Shubham >>>>>> >>>>>> Thanks for your interest. I'm eager to see your contributions to >>>>>> sklearn in the future. However, I'm pretty sure kmeans++ is already >>>>>> implemented: http://scikit-learn.org/stable/modules/generate >>>>>> d/sklearn.cluster.KMeans.html >>>>>> >>>>>> Jacob >>>>>> >>>>>> On Thu, Mar 2, 2017 at 1:07 AM, SHUBHAM BHARDWAJ 15BCE0704 < >>>>>> shubham.bhardwaj2015 at vit.ac.in> wrote: >>>>>> >>>>>>> Hello Sir, >>>>>>> >>>>>>> My introduction : >>>>>>> I am a 2nd year student studying Computer Science and engineering >>>>>>> from VIT, Vellore. I work in Google Developers Group VIT. All my experience >>>>>>> has been about collaborating with a lot of people ,working as a team, >>>>>>> building products and learning along the way. >>>>>>> Since scikit-learn is participating this time I am too planning to >>>>>>> submit a proposal. >>>>>>> >>>>>>> Proposal idea: >>>>>>> I am really interested in implementing kmeans++ algorithm.I was >>>>>>> doing some work on DT but I found this very appealing. Just wanted to know >>>>>>> if it can be a good project idea. >>>>>>> >>>>>>> Regards >>>>>>> Shubham Bhardwaj >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn at python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> >>>> -- >>>> Guillaume Lemaitre >>>> INRIA Saclay - Ile-de-France >>>> Equipe PARIETAL >>>> guillaume.lemaitre at inria.f r --- >>>> https://glemaitre.github.io/ >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeff3456 at gmail.com Wed Mar 22 11:37:37 2017 From: jeff3456 at gmail.com (Jeff Lee) Date: Wed, 22 Mar 2017 11:37:37 -0400 Subject: [scikit-learn] Regarding GSoC projects and mentors Message-ID: Hi, My name is Jefferson Lee and I am a computer science student at NYU passionate about machine learning and AI. I was hoping to speak with potential mentors regarding the two suggested projects and how I might research more about the topics to write a very strong proposal. I also wanted to gain learn about some of the stalled projects that might be viable as a GSoC project. Let me know if anyone is available to chat about these topics! Best, Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Wed Mar 22 17:01:12 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Wed, 22 Mar 2017 14:01:12 -0700 Subject: [scikit-learn] Regarding GSoC projects and mentors In-Reply-To: References: Message-ID: Hi Jeff I would be overseeing the parallel decision tree building project, and Gael is overseeing the linear models project. This will end up being fairly fluid, as we're looking for the right combination of mentors and students. Jacob On Wed, Mar 22, 2017 at 8:37 AM, Jeff Lee wrote: > Hi, > > My name is Jefferson Lee and I am a computer science student at NYU > passionate about machine learning and AI. > I was hoping to speak with potential mentors regarding the two suggested > projects > > and how I might research more about the topics to write a very strong > proposal. > > I also wanted to gain learn about some of the stalled projects that might > be viable as a GSoC project. > > Let me know if anyone is available to chat about these topics! > > Best, > Jeff > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Wed Mar 22 17:08:45 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Wed, 22 Mar 2017 14:08:45 -0700 Subject: [scikit-learn] GSoC 2017 : "Parallel Decision Tree Building" In-Reply-To: References: Message-ID: Hi Aman Likely the easiest way to parallelize decision tree building is to parallelize the finding of the best split at each node, as it checks every non-constant feature for the best split. Several other approaches focus on how to parallelize tree building in the streaming or distributed cases, which we are not interested in at the moment (though partially fitting decision trees is a good separate project). As I mentioned in the github issue, it is likely easier to focus on this single issue for GSoC as opposed to making it distinct from the multiclass prediction, as this will provide similar speedups either way but be more general. It'd be great if you could add your experience directly to the gist and perhaps links to prior work if you have any of those. Something major missing from this is a proposed timeline. Several projects fail because they are overly ambitious or not well managed time-wise. Showing a timeline will help us manage the project later on, and ensure that you're aware of what the steps of the project will be. Thanks for the effort so far! Let me know when you've made updates. Jacob On Wed, Mar 22, 2017 at 12:55 AM, Aman Pratik wrote: > Hello Developers, > > This is Aman Pratik. I am currently pursuing my B.Tech from Indian > Institute of Technology, Varanasi. After doing some research I have found > some material on Decision Trees and Parallelization. Hence, I propose my > first draft for the project "Parallel Decision Tree Building" for GSoC 2017. > > Proposal : First Draft > > > Why me? > > I have been working in Python for the past 2 years and have good idea > about Machine Learning algorithms. I am quite familiar with scikit-learn > both as a user and a developer. > > These are the issues/PRs I have worked/working on for the past few months. > > [MRG+1] Issue#5803 : Regression Test added #8112 > > > [MRG] Issue#6673:Make a wrapper around functions that score an individual > feature #8038 > > [MRG] Issue #7987: Embarrassingly parallel "n_restarts_optimizer" in > GaussianProcessRegressor #7997 > > > My GitHub Profile: amanp10 > > I have worked with parallelization in one of my PR, so I am not new to it. > I have used cython a couple of times, though as a beginner. I have not used > Decision Tree much but I am familiar with the theory and algorithm. Also, I > am familiar with Benchmark tests, Unit tests and other technical knowledge > I would require for this project. > > Meanwhile, I have started my study for the subject and gaining experience > with Cython. I am looking forward to guidance from the potential mentors or > anyone willing to help. > > Thank You > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Wed Mar 22 19:53:56 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 23 Mar 2017 10:53:56 +1100 Subject: [scikit-learn] Regarding GSoC projects and mentors In-Reply-To: References: Message-ID: Hi Jeff, Given the timeframe, it would be difficult for us to have confidence in your abilities, having not seen your work and thus your understanding of scikit-learn conventions and review process. If you think applying this year is the right way to go, you should try to make contributions ASAP. Cheers, Joel On 23 March 2017 at 02:37, Jeff Lee wrote: > Hi, > > My name is Jefferson Lee and I am a computer science student at NYU > passionate about machine learning and AI. > I was hoping to speak with potential mentors regarding the two suggested > projects > > and how I might research more about the topics to write a very strong > proposal. > > I also wanted to gain learn about some of the stalled projects that might > be viable as a GSoC project. > > Let me know if anyone is available to chat about these topics! > > Best, > Jeff > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From konst.katrioplas at gmail.com Thu Mar 23 04:06:30 2017 From: konst.katrioplas at gmail.com (Konstantinos Katrioplas) Date: Thu, 23 Mar 2017 10:06:30 +0200 Subject: [scikit-learn] GSoC proposal - Improve online learning for linear models In-Reply-To: <85562a68-8c93-b616-cb32-95e8a5025bd2@gmail.com> References: <85562a68-8c93-b616-cb32-95e8a5025bd2@gmail.com> Message-ID: Hello all, Please review my proposal on improving the online learning for linear models: first draft - linear model proposal Please bear in mind that this is a first approach only. I would like your opinion on if this goes into the right direction and how it can be improved. On the tool to set the learning rate in particular, I need your ideas on how it could be implemented. Previously mentioned ideas on a callback function are interesting, but I would need some guidance on implementing that. Although I am interested in the decision trees as well, I feel the linear model is a better start for me as my intention is to keep contributing to scikit-learn after the summer. I have a background in computational physics, however I am much more focused on the computational side than the physics side. Here is my resume . PRs and Issues I have been involved in so far: [MRG] enhance make_blobs to accept lists for samples per cluster [MRG] add random_state in tests estimators Bug in bfgs gradient computation of MLPRegressor with multiple output neurons (I am very curious about this one) github profile: kkatrio I am looking forward to your opinion. Kind regards, Konstantinos Katrioplas -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Thu Mar 23 12:13:45 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 23 Mar 2017 12:13:45 -0400 Subject: [scikit-learn] Note of appreciation to Scikit-learn team In-Reply-To: References: Message-ID: <3078b3f7-f3d1-a45b-c466-805b7c63f545@gmail.com> I want to join Ga?l in thanking you for saying thanks. It's great to see appreciation of the work that the scientific python community does. I don't think I've seen anyone cite scipy in their research work, even though it is the backbone for so many papers. It's important for us that the academic environment recognizes software contributions, because many of us rely on academic funding to do this work. Best, Andy On 03/22/2017 01:10 AM, Brown J.B. wrote: > To all organizers, developers, and maintainers involved in the > Scikit-learn project, > > I would like to share a recent article that researchers from MIT, ETH, > and Kyoto University (myself) have published about building efficient > models for drug discovery and pharmaceutical data mining. > > In short, it demonstrates through replicate experiment that neither > big data nor complex AI such as deep learning are necessary for > efficient drug discovery, and that active learning can guide/assist > decision making processes in the real world. > > The paper's success is underpinned by the use of Scikit-learn's > RandomForestClassifier implementation combined with other techniques > developed in the work. > Therefore, it is a by-product of the volunteerism, hard work, and > dedication by those involved in scikit-learn. > > As the senior author of this study, I wish to share my great > appreciation for your efforts. > While I am strongly limited in time and can barely contribute to this > community, I cannot thank all of you enough for your work - it has > made an impact. > > We are working on theoretical extensions of the work now, as well as > pushing the technology forward in applied discovery sciences (in > agricultural, pharmaceutical, and medical areas). In the theory and > real-world applications, scikit-learn is indispensible. > > We have made the paper open access, and hope that such will inspire > this community as well as those in applied sciences. > You will see that the open source software community has been listed > in the Acknowledgments. > Certainly, we would welcome even the most casual of comments about the > paper. > > The paper can be retrieved from here: > http://www.future-science.com/doi/abs/10.4155/fmc-2016-0197 > > With kindest regards and sincere appreciation, > J.B. Brown > Kyoto University Graduate School of Medicine > Junior Associate Professor and Principal Investigator > http://statlsi.med.kyoto-u.ac.jp/~jbbrown > > > PS - To those of you involved in the matplotlib, scipy, and numpy > projects, your forwarding of this to those projects would be > appreciated. They were also critical. > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From raga.markely at gmail.com Thu Mar 23 16:05:44 2017 From: raga.markely at gmail.com (Raga Markely) Date: Thu, 23 Mar 2017 16:05:44 -0400 Subject: [scikit-learn] Note of appreciation to Scikit-learn team In-Reply-To: <3078b3f7-f3d1-a45b-c466-805b7c63f545@gmail.com> References: <3078b3f7-f3d1-a45b-c466-805b7c63f545@gmail.com> Message-ID: Will definitely acknowledge scikit-learn, scipy, etc community in papers, posters, talks, etc.. i also saw suggested citations on scikit-learn website.. i will include these as well..if there is anything else that will be helpful, please let us know.. Sincerely hope that all of your contributions (not just the codes, but also tutorial in scipy conference, books & blogs that you have published, etc) will help you in your careers in many different ways.. Best, Raga On Mar 23, 2017 11:15 AM, "Andreas Mueller" wrote: > I want to join Ga?l in thanking you for saying thanks. > It's great to see appreciation of the work that the scientific python > community does. > I don't think I've seen anyone cite scipy in their research work, even > though it is the backbone for so many papers. > It's important for us that the academic environment recognizes software > contributions, because many > of us rely on academic funding to do this work. > > Best, > Andy > > On 03/22/2017 01:10 AM, Brown J.B. wrote: > > To all organizers, developers, and maintainers involved in the > Scikit-learn project, > > I would like to share a recent article that researchers from MIT, ETH, and > Kyoto University (myself) have published about building efficient models > for drug discovery and pharmaceutical data mining. > > In short, it demonstrates through replicate experiment that neither big > data nor complex AI such as deep learning are necessary for efficient drug > discovery, and that active learning can guide/assist decision making > processes in the real world. > > The paper's success is underpinned by the use of Scikit-learn's > RandomForestClassifier implementation combined with other techniques > developed in the work. > Therefore, it is a by-product of the volunteerism, hard work, and > dedication by those involved in scikit-learn. > > As the senior author of this study, I wish to share my great appreciation > for your efforts. > While I am strongly limited in time and can barely contribute to this > community, I cannot thank all of you enough for your work - it has made an > impact. > > We are working on theoretical extensions of the work now, as well as > pushing the technology forward in applied discovery sciences (in > agricultural, pharmaceutical, and medical areas). In the theory and > real-world applications, scikit-learn is indispensible. > > We have made the paper open access, and hope that such will inspire this > community as well as those in applied sciences. > You will see that the open source software community has been listed in > the Acknowledgments. > Certainly, we would welcome even the most casual of comments about the > paper. > > The paper can be retrieved from here: > http://www.future-science.com/doi/abs/10.4155/fmc-2016-0197 > > With kindest regards and sincere appreciation, > J.B. Brown > Kyoto University Graduate School of Medicine > Junior Associate Professor and Principal Investigator > http://statlsi.med.kyoto-u.ac.jp/~jbbrown > > PS - To those of you involved in the matplotlib, scipy, and numpy > projects, your forwarding of this to those projects would be appreciated. > They were also critical. > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ragvrv at gmail.com Fri Mar 24 17:26:28 2017 From: ragvrv at gmail.com (Raghav R V) Date: Fri, 24 Mar 2017 22:26:28 +0100 Subject: [scikit-learn] Preparing a scikit-learn 0.18.2 bugfix release In-Reply-To: References: <20170109151546.GM2802991@phare.normalesup.org> <20170111215115.GO1585067@phare.normalesup.org> Message-ID: Hi, Are we still planning on an early April release for v0.19? Could we start marking "blockers"? On Tue, Feb 21, 2017 at 5:31 PM, Andreas Mueller wrote: > > > On 02/07/2017 09:00 PM, Joel Nothman wrote: > > On 12 January 2017 at 08:51, Gael Varoquaux > wrote: > >> On Thu, Jan 12, 2017 at 08:41:51AM +1100, Joel Nothman wrote: >> > When the two versions deprecation policy was instituted, releases were >> much >> > more frequent... Is that enough of an excuse? >> >> I'd rather say that we can here decide that we are giving a longer grace >> period. >> >> I think that slow deprecations are a good things (see titus's blog post >> here: http://ivory.idyll.org/blog/2017-pof-software-archivability.html ) >> > > Given that 0.18 was a very slow release, and the work for removing > deprecated material from 0.19 has already been done, I don't think we > should revert that. I agree that we can delay the deprecation deadline for > 0.20 and 0.21. > > In terms of release schedule, are we aiming for RC in early-mid March, > assuming Andy's above prognostications are correct and he is able to review > in a bigger way in a week or so? > > Sometimes I wonder how Amazon ever gave me a job in forecasting.... > Spring break is March 13-17th ;) > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Raghav RV https://github.com/raghavrv -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Sat Mar 25 15:54:37 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Sat, 25 Mar 2017 15:54:37 -0400 Subject: [scikit-learn] Preparing a scikit-learn 0.18.2 bugfix release In-Reply-To: References: <20170109151546.GM2802991@phare.normalesup.org> <20170111215115.GO1585067@phare.normalesup.org> Message-ID: <69bce798-a914-8233-bb2c-e7dedbf20641@gmail.com> I have no bandwidth to help. I will be able to help starting May 7th. On 03/24/2017 05:26 PM, Raghav R V wrote: > Hi, > > Are we still planning on an early April release for v0.19? Could we > start marking "blockers"? > > > > On Tue, Feb 21, 2017 at 5:31 PM, Andreas Mueller > wrote: > > > > On 02/07/2017 09:00 PM, Joel Nothman wrote: >> On 12 January 2017 at 08:51, Gael Varoquaux >> > > wrote: >> >> On Thu, Jan 12, 2017 at 08:41:51AM +1100, Joel Nothman wrote: >> > When the two versions deprecation policy was instituted, >> releases were much >> > more frequent... Is that enough of an excuse? >> >> I'd rather say that we can here decide that we are giving a >> longer grace >> period. >> >> I think that slow deprecations are a good things (see titus's >> blog post >> here: >> http://ivory.idyll.org/blog/2017-pof-software-archivability.html >> >> ) >> >> Given that 0.18 was a very slow release, and the work for >> removing deprecated material from 0.19 has already been done, I >> don't think we should revert that. I agree that we can delay the >> deprecation deadline for 0.20 and 0.21. >> >> In terms of release schedule, are we aiming for RC in early-mid >> March, assuming Andy's above prognostications are correct and he >> is able to review in a bigger way in a week or so? >> > Sometimes I wonder how Amazon ever gave me a job in forecasting.... > Spring break is March 13-17th ;) > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > -- > Raghav RV > https://github.com/raghavrv > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sat Mar 25 21:32:05 2017 From: joel.nothman at gmail.com (Joel Nothman) Date: Sun, 26 Mar 2017 12:32:05 +1100 Subject: [scikit-learn] Preparing a scikit-learn 0.18.2 bugfix release In-Reply-To: <69bce798-a914-8233-bb2c-e7dedbf20641@gmail.com> References: <20170109151546.GM2802991@phare.normalesup.org> <20170111215115.GO1585067@phare.normalesup.org> <69bce798-a914-8233-bb2c-e7dedbf20641@gmail.com> Message-ID: Yes, it's a pity that this has had to be delayed due to dev unavailability, but I don't think we can risk a release without some more quality assurance. My teaching atm, among other bits of life, is also impacting on any free time, but even if I find more time, I've already given my support to many of the PRs currently marked MRG+1 (have I been too profligate with my approvals?!). Is it worth waiting as long as until the June sprint, but promising to close the release before end of June? Or else promising a release for end of May and using the sprint to identify priorities for future releases? I think for the sake of the contributors, we should make sure that many of the things that are mostly reviewed get merged before release. For the sake of the users, we should make sure that as many bugs are fixed as possible; apart from some wonderful work from Lo?c, I feel bug review has not been receiving as much attention as it should. Perhaps Olivier's suggestion of 0.18.2 was good after all. :\ On 26 March 2017 at 06:54, Andreas Mueller wrote: > I have no bandwidth to help. I will be able to help starting May 7th. > > > On 03/24/2017 05:26 PM, Raghav R V wrote: > > Hi, > > Are we still planning on an early April release for v0.19? Could we start > marking "blockers"? > > > > On Tue, Feb 21, 2017 at 5:31 PM, Andreas Mueller wrote: > >> >> >> On 02/07/2017 09:00 PM, Joel Nothman wrote: >> >> On 12 January 2017 at 08:51, Gael Varoquaux < >> gael.varoquaux at normalesup.org> wrote: >> >>> On Thu, Jan 12, 2017 at 08:41:51AM +1100, Joel Nothman wrote: >>> > When the two versions deprecation policy was instituted, releases were >>> much >>> > more frequent... Is that enough of an excuse? >>> >>> I'd rather say that we can here decide that we are giving a longer grace >>> period. >>> >>> I think that slow deprecations are a good things (see titus's blog post >>> here: http://ivory.idyll.org/blog/2017-pof-software-archivability.html ) >>> >> >> Given that 0.18 was a very slow release, and the work for removing >> deprecated material from 0.19 has already been done, I don't think we >> should revert that. I agree that we can delay the deprecation deadline for >> 0.20 and 0.21. >> >> In terms of release schedule, are we aiming for RC in early-mid March, >> assuming Andy's above prognostications are correct and he is able to review >> in a bigger way in a week or so? >> >> Sometimes I wonder how Amazon ever gave me a job in forecasting.... >> Spring break is March 13-17th ;) >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Raghav RV > https://github.com/raghavrv > > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amanpratik10 at gmail.com Sun Mar 26 13:31:43 2017 From: amanpratik10 at gmail.com (Aman Pratik) Date: Sun, 26 Mar 2017 23:01:43 +0530 Subject: [scikit-learn] GSoC 2017 : "Parallel Decision Tree Building" In-Reply-To: References: Message-ID: Hello Jacob, This is my second draft for the proposal, Proposal : Second Draft It is incomplete in some places, related to detailing etc. I will need little more time for that. Meanwhile, I await your feedback and guidance. Thank You On 23 March 2017 at 02:38, Jacob Schreiber wrote: > Hi Aman > > Likely the easiest way to parallelize decision tree building is to > parallelize the finding of the best split at each node, as it checks every > non-constant feature for the best split. Several other approaches focus on > how to parallelize tree building in the streaming or distributed cases, > which we are not interested in at the moment (though partially fitting > decision trees is a good separate project). > > As I mentioned in the github issue, it is likely easier to focus on this > single issue for GSoC as opposed to making it distinct from the multiclass > prediction, as this will provide similar speedups either way but be more > general. > > It'd be great if you could add your experience directly to the gist and > perhaps links to prior work if you have any of those. > > Something major missing from this is a proposed timeline. Several projects > fail because they are overly ambitious or not well managed time-wise. > Showing a timeline will help us manage the project later on, and ensure > that you're aware of what the steps of the project will be. > > Thanks for the effort so far! Let me know when you've made updates. > > Jacob > > On Wed, Mar 22, 2017 at 12:55 AM, Aman Pratik > wrote: > >> Hello Developers, >> >> This is Aman Pratik. I am currently pursuing my B.Tech from Indian >> Institute of Technology, Varanasi. After doing some research I have found >> some material on Decision Trees and Parallelization. Hence, I propose my >> first draft for the project "Parallel Decision Tree Building" for GSoC 2017. >> >> Proposal : First Draft >> >> >> Why me? >> >> I have been working in Python for the past 2 years and have good idea >> about Machine Learning algorithms. I am quite familiar with scikit-learn >> both as a user and a developer. >> >> These are the issues/PRs I have worked/working on for the past few months. >> >> [MRG+1] Issue#5803 : Regression Test added #8112 >> >> >> [MRG] Issue#6673:Make a wrapper around functions that score an individual >> feature #8038 >> >> [MRG] Issue #7987: Embarrassingly parallel "n_restarts_optimizer" in >> GaussianProcessRegressor #7997 >> >> >> My GitHub Profile: amanp10 >> >> I have worked with parallelization in one of my PR, so I am not new to >> it. I have used cython a couple of times, though as a beginner. I have not >> used Decision Tree much but I am familiar with the theory and algorithm. >> Also, I am familiar with Benchmark tests, Unit tests and other technical >> knowledge I would require for this project. >> >> Meanwhile, I have started my study for the subject and gaining experience >> with Cython. I am looking forward to guidance from the potential mentors or >> anyone willing to help. >> >> Thank You >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Sun Mar 26 18:32:05 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Sun, 26 Mar 2017 18:32:05 -0400 Subject: [scikit-learn] Preparing a scikit-learn 0.18.2 bugfix release In-Reply-To: References: <20170109151546.GM2802991@phare.normalesup.org> <20170111215115.GO1585067@phare.normalesup.org> <69bce798-a914-8233-bb2c-e7dedbf20641@gmail.com> Message-ID: I would like to release in may, before the sprint. That is, if we are happy with where the codebase is at then. If someone feels like they have the time end energy to create 0.18.2, and we have enough reviewers to ensure quality, I'm not opposed. I just won't be able to be of any help. On 03/25/2017 09:32 PM, Joel Nothman wrote: > Yes, it's a pity that this has had to be delayed due to dev > unavailability, but I don't think we can risk a release without some > more quality assurance. My teaching atm, among other bits of life, is > also impacting on any free time, but even if I find more time, I've > already given my support to many of the PRs currently marked MRG+1 > (have > I been too profligate with my approvals?!). > > Is it worth waiting as long as until the June sprint, but promising to > close the release before end of June? Or else promising a release for > end of May and using the sprint to identify priorities for future > releases? > > I think for the sake of the contributors, we should make sure that > many of the things that are mostly reviewed get merged before release. > For the sake of the users, we should make sure that as many bugs are > fixed as possible; apart from some wonderful work from Lo?c, I feel > bug review has not been receiving as much attention as it should. > > Perhaps Olivier's suggestion of 0.18.2 was good after all. :\ > > On 26 March 2017 at 06:54, Andreas Mueller > wrote: > > I have no bandwidth to help. I will be able to help starting May 7th. > > > On 03/24/2017 05:26 PM, Raghav R V wrote: >> Hi, >> >> Are we still planning on an early April release for v0.19? Could >> we start marking "blockers"? >> >> >> >> On Tue, Feb 21, 2017 at 5:31 PM, Andreas Mueller >> > wrote: >> >> >> >> On 02/07/2017 09:00 PM, Joel Nothman wrote: >>> On 12 January 2017 at 08:51, Gael Varoquaux >>> >> > wrote: >>> >>> On Thu, Jan 12, 2017 at 08:41:51AM +1100, Joel Nothman >>> wrote: >>> > When the two versions deprecation policy was >>> instituted, releases were much >>> > more frequent... Is that enough of an excuse? >>> >>> I'd rather say that we can here decide that we are >>> giving a longer grace >>> period. >>> >>> I think that slow deprecations are a good things (see >>> titus's blog post >>> here: >>> http://ivory.idyll.org/blog/2017-pof-software-archivability.html >>> >>> ) >>> >>> Given that 0.18 was a very slow release, and the work for >>> removing deprecated material from 0.19 has already been >>> done, I don't think we should revert that. I agree that we >>> can delay the deprecation deadline for 0.20 and 0.21. >>> >>> In terms of release schedule, are we aiming for RC in >>> early-mid March, assuming Andy's above prognostications are >>> correct and he is able to review in a bigger way in a week >>> or so? >>> >> Sometimes I wonder how Amazon ever gave me a job in >> forecasting.... >> Spring break is March 13-17th ;) >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> >> -- >> Raghav RV >> https://github.com/raghavrv >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ scikit-learn > mailing list scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Sun Mar 26 23:33:33 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Sun, 26 Mar 2017 20:33:33 -0700 Subject: [scikit-learn] GSoC 2017 : "Parallel Decision Tree Building" In-Reply-To: References: Message-ID: Hi Aman Thanks for the updates, it looks more complete now. I don't see what the benefit is of considering three different parallelism techniques. I'm not sure how you would do sample parallelism given that you need to sort all of the samples-- maybe a merge sort? That doesn't seem the most efficient manner of parallelization, I'd stick only to parallelism across features as you can get a great deal of efficiency out of doing that. It also makes the problem more managable. I would also focus your application more specifically on what parts of the code you will need to change and less conceptual. There is already a loop to consider features sequentially and identify the best one. The change is basically to parallelize this in the best manner given the other code. However, if the solution were as easy as changing the for loop to a Parallel()( delayed ) type schema we would have done it already. You should specify what the challenges will be, and why it isn't just as simple as that. Specifically focus on what goes on in the criterion class to make it more difficult. I also checked out your gaussian process parallelization. It looked like it wasn't speeding anything up because you were using a threading backend for a python function. You can only use the threading backend with a cython function where you also release the GIL, otherwise it won't help. Have you tried using the multiprocessing backend? That would likely be easier. Jacob On Sun, Mar 26, 2017 at 10:31 AM, Aman Pratik wrote: > Hello Jacob, > This is my second draft for the proposal, > > Proposal : Second Draft > > > It is incomplete in some places, related to detailing etc. I will need > little more time for that. Meanwhile, I await your feedback and guidance. > > Thank You > > > > On 23 March 2017 at 02:38, Jacob Schreiber > wrote: > >> Hi Aman >> >> Likely the easiest way to parallelize decision tree building is to >> parallelize the finding of the best split at each node, as it checks every >> non-constant feature for the best split. Several other approaches focus on >> how to parallelize tree building in the streaming or distributed cases, >> which we are not interested in at the moment (though partially fitting >> decision trees is a good separate project). >> >> As I mentioned in the github issue, it is likely easier to focus on this >> single issue for GSoC as opposed to making it distinct from the multiclass >> prediction, as this will provide similar speedups either way but be more >> general. >> >> It'd be great if you could add your experience directly to the gist and >> perhaps links to prior work if you have any of those. >> >> Something major missing from this is a proposed timeline. Several >> projects fail because they are overly ambitious or not well managed >> time-wise. Showing a timeline will help us manage the project later on, and >> ensure that you're aware of what the steps of the project will be. >> >> Thanks for the effort so far! Let me know when you've made updates. >> >> Jacob >> >> On Wed, Mar 22, 2017 at 12:55 AM, Aman Pratik >> wrote: >> >>> Hello Developers, >>> >>> This is Aman Pratik. I am currently pursuing my B.Tech from Indian >>> Institute of Technology, Varanasi. After doing some research I have found >>> some material on Decision Trees and Parallelization. Hence, I propose my >>> first draft for the project "Parallel Decision Tree Building" for GSoC 2017. >>> >>> Proposal : First Draft >>> >>> >>> Why me? >>> >>> I have been working in Python for the past 2 years and have good idea >>> about Machine Learning algorithms. I am quite familiar with scikit-learn >>> both as a user and a developer. >>> >>> These are the issues/PRs I have worked/working on for the past few >>> months. >>> >>> [MRG+1] Issue#5803 : Regression Test added #8112 >>> >>> >>> [MRG] Issue#6673:Make a wrapper around functions that score an >>> individual feature #8038 >>> >>> >>> [MRG] Issue #7987: Embarrassingly parallel "n_restarts_optimizer" in >>> GaussianProcessRegressor #7997 >>> >>> >>> My GitHub Profile: amanp10 >>> >>> I have worked with parallelization in one of my PR, so I am not new to >>> it. I have used cython a couple of times, though as a beginner. I have not >>> used Decision Tree much but I am familiar with the theory and algorithm. >>> Also, I am familiar with Benchmark tests, Unit tests and other technical >>> knowledge I would require for this project. >>> >>> Meanwhile, I have started my study for the subject and gaining >>> experience with Cython. I am looking forward to guidance from the potential >>> mentors or anyone willing to help. >>> >>> Thank You >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From on2k17nm at gmail.com Mon Mar 27 00:03:51 2017 From: on2k17nm at gmail.com (Onkar Mahajan) Date: Mon, 27 Mar 2017 09:33:51 +0530 Subject: [scikit-learn] Create sample dataset with specified range and regression coefficients Message-ID: I would like to create a random dataset for Linear regression with specific regression coefficients (reg.coef_, reg.intercept_ ) and with data in specific range of values (person age - 0 to 100 as x - axis and y-axis Net worth 0$ to 5,00,000$). I used sklearn.datasets.make_regression() but I did not find anything in here that gives me control over range of samples and regression coefficients (I might be missing something, please correct me if mistaken) Thanks, Onkar -------------- next part -------------- An HTML attachment was scrubbed... URL: From amanpratik10 at gmail.com Mon Mar 27 01:44:42 2017 From: amanpratik10 at gmail.com (Aman Pratik) Date: Mon, 27 Mar 2017 11:14:42 +0530 Subject: [scikit-learn] GSoC 2017 : "Parallel Decision Tree Building" In-Reply-To: References: Message-ID: I will be occupied with my tests for a couple of days, will get back with the changes as soon as possible. In the Gaussian Process parallelization there was an error while using the multiprocessing backend, which couldn't be solved by simple changes in the code. Hence we had to drop the idea for the time being. On 27 March 2017 at 09:03, Jacob Schreiber wrote: > Hi Aman > > Thanks for the updates, it looks more complete now. > > I don't see what the benefit is of considering three different parallelism > techniques. I'm not sure how you would do sample parallelism given that you > need to sort all of the samples-- maybe a merge sort? That doesn't seem the > most efficient manner of parallelization, I'd stick only to parallelism > across features as you can get a great deal of efficiency out of doing > that. It also makes the problem more managable. > > I would also focus your application more specifically on what parts of the > code you will need to change and less conceptual. There is already a loop > to consider features sequentially and identify the best one. The change is > basically to parallelize this in the best manner given the other code. > However, if the solution were as easy as changing the for loop to a > Parallel()( delayed ) type schema we would have done it already. You should > specify what the challenges will be, and why it isn't just as simple as > that. Specifically focus on what goes on in the criterion class to make it > more difficult. > > I also checked out your gaussian process parallelization. It looked like > it wasn't speeding anything up because you were using a threading backend > for a python function. You can only use the threading backend with a cython > function where you also release the GIL, otherwise it won't help. Have you > tried using the multiprocessing backend? That would likely be easier. > > Jacob > > On Sun, Mar 26, 2017 at 10:31 AM, Aman Pratik > wrote: > >> Hello Jacob, >> This is my second draft for the proposal, >> >> Proposal : Second Draft >> >> >> It is incomplete in some places, related to detailing etc. I will need >> little more time for that. Meanwhile, I await your feedback and guidance. >> >> Thank You >> >> >> >> On 23 March 2017 at 02:38, Jacob Schreiber >> wrote: >> >>> Hi Aman >>> >>> Likely the easiest way to parallelize decision tree building is to >>> parallelize the finding of the best split at each node, as it checks every >>> non-constant feature for the best split. Several other approaches focus on >>> how to parallelize tree building in the streaming or distributed cases, >>> which we are not interested in at the moment (though partially fitting >>> decision trees is a good separate project). >>> >>> As I mentioned in the github issue, it is likely easier to focus on this >>> single issue for GSoC as opposed to making it distinct from the multiclass >>> prediction, as this will provide similar speedups either way but be more >>> general. >>> >>> It'd be great if you could add your experience directly to the gist and >>> perhaps links to prior work if you have any of those. >>> >>> Something major missing from this is a proposed timeline. Several >>> projects fail because they are overly ambitious or not well managed >>> time-wise. Showing a timeline will help us manage the project later on, and >>> ensure that you're aware of what the steps of the project will be. >>> >>> Thanks for the effort so far! Let me know when you've made updates. >>> >>> Jacob >>> >>> On Wed, Mar 22, 2017 at 12:55 AM, Aman Pratik >>> wrote: >>> >>>> Hello Developers, >>>> >>>> This is Aman Pratik. I am currently pursuing my B.Tech from Indian >>>> Institute of Technology, Varanasi. After doing some research I have found >>>> some material on Decision Trees and Parallelization. Hence, I propose my >>>> first draft for the project "Parallel Decision Tree Building" for GSoC 2017. >>>> >>>> Proposal : First Draft >>>> >>>> >>>> Why me? >>>> >>>> I have been working in Python for the past 2 years and have good idea >>>> about Machine Learning algorithms. I am quite familiar with scikit-learn >>>> both as a user and a developer. >>>> >>>> These are the issues/PRs I have worked/working on for the past few >>>> months. >>>> >>>> [MRG+1] Issue#5803 : Regression Test added #8112 >>>> >>>> >>>> [MRG] Issue#6673:Make a wrapper around functions that score an >>>> individual feature #8038 >>>> >>>> >>>> [MRG] Issue #7987: Embarrassingly parallel "n_restarts_optimizer" in >>>> GaussianProcessRegressor #7997 >>>> >>>> >>>> My GitHub Profile: amanp10 >>>> >>>> I have worked with parallelization in one of my PR, so I am not new to >>>> it. I have used cython a couple of times, though as a beginner. I have not >>>> used Decision Tree much but I am familiar with the theory and algorithm. >>>> Also, I am familiar with Benchmark tests, Unit tests and other technical >>>> knowledge I would require for this project. >>>> >>>> Meanwhile, I have started my study for the subject and gaining >>>> experience with Cython. I am looking forward to guidance from the potential >>>> mentors or anyone willing to help. >>>> >>>> Thank You >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Mar 27 10:57:57 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 27 Mar 2017 10:57:57 -0400 Subject: [scikit-learn] Create sample dataset with specified range and regression coefficients In-Reply-To: References: Message-ID: <21cbe617-47ac-1381-772d-54423a24f278@gmail.com> Yes, make_regression is for quickly making a random task. Doing what you want should be about three lines of numpy, why do you need a function for it? On 03/27/2017 12:03 AM, Onkar Mahajan wrote: > I would like to create a random dataset for Linear regression with > specific regression coefficients (reg.coef_, reg.intercept_ ) and with > data in specific range of values (person age - 0 to 100 as x - axis > and y-axis Net worth 0$ to 5,00,000$). I used > sklearn.datasets.make_regression() but I did not find anything in here > that gives me control over range of samples and regression > coefficients (I might be missing something, please correct me if mistaken) > > Thanks, > Onkar > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From henriquecsj at gmail.com Mon Mar 27 11:50:32 2017 From: henriquecsj at gmail.com (Henrique C. S. Junior) Date: Mon, 27 Mar 2017 12:50:32 -0300 Subject: [scikit-learn] Using Scikit-Learn to predict magnetism in chemical systems Message-ID: I'm a chemist with some rudimentary programming skills (getting started with python) and in the middle of the year I'll be starting a Ph.D. project that uses computers to describe magnetism in molecular systems. Most of the time I get my results after several simulations and experiments, so, I know that one of the hardest tasks in molecular magnetism is to predict the nature of magnetic interactions. That's why I'll try to tackle this problem with Machine Learning (because such interactions are dependent, basically, of distances, angles and number of unpaired electrons). The idea is to feed the computer with a large training set (with number of unpaired electrons, XYZ coordinates of each molecule and experimental magnetic couplings) and see if it can predict the magnetic couplings (J(AB)) of new systems: (see example in the attached image) Can Scikit-Learn handle the task, knowing that the matrix used to represent atomic coordinates will probably have a different number of atoms (because some molecules have more atoms than others)? Or is this a job better suited for another software/approach? ? -- *Henrique C. S. Junior* Industrial Chemist - UFRRJ M. Sc. Inorganic Chemistry - UFRRJ Data Processing Center - PMP Visite o Mundo Qu?mico -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 2017-03-21.png Type: image/png Size: 10127 bytes Desc: not available URL: From rdslater at gmail.com Mon Mar 27 13:25:29 2017 From: rdslater at gmail.com (Robert Slater) Date: Mon, 27 Mar 2017 12:25:29 -0500 Subject: [scikit-learn] Using Scikit-Learn to predict magnetism in chemical systems In-Reply-To: References: Message-ID: You definitely can use some of the tools in sci-kit learn for supervised machine learning. The real trick will be how well your training system is representative of your future predictions. All of the various regression algorithms would be of some value and you make even consider an ensemble to help generalize. There will be some important questions to answer--what kind of loss function do you want to look at? I assumed regression (continuous response) but it could also classify--paramagnetic, diamagnetic, ferromagnetic, etc... Another task to think about might be dimension reduction. There is no guarantee you will get fantastic results--every problem is unique and much will depend on exactly what you want out of the solution--it may be that we get '10%' accuracy at best--for some systems that is quite good, others it is horrible. If you'd like to talk specifics, feel free to contact me at this email. I have a background in magnetism (PhD in magnetic multilayers--i was physics, but as you are probably aware chemisty and physics blend in this area) and have a fairly good knowledge of sci-kit learn and machine learning. On Mon, Mar 27, 2017 at 10:50 AM, Henrique C. S. Junior < henriquecsj at gmail.com> wrote: > I'm a chemist with some rudimentary programming skills (getting started > with python) and in the middle of the year I'll be starting a Ph.D. project > that uses computers to describe magnetism in molecular systems. > > Most of the time I get my results after several simulations and > experiments, so, I know that one of the hardest tasks in molecular > magnetism is to predict the nature of magnetic interactions. That's why > I'll try to tackle this problem with Machine Learning (because such > interactions are dependent, basically, of distances, angles and number of > unpaired electrons). The idea is to feed the computer with a large training > set (with number of unpaired electrons, XYZ coordinates of each molecule > and experimental magnetic couplings) and see if it can predict the magnetic > couplings (J(AB)) of new systems: > (see example in the attached image) > > Can Scikit-Learn handle the task, knowing that the matrix used to > represent atomic coordinates will probably have a different number of atoms > (because some molecules have more atoms than others)? Or is this a job > better suited for another software/approach? ? > > > -- > *Henrique C. S. Junior* > Industrial Chemist - UFRRJ > M. Sc. Inorganic Chemistry - UFRRJ > Data Processing Center - PMP > Visite o Mundo Qu?mico > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From konst.katrioplas at gmail.com Mon Mar 27 13:43:47 2017 From: konst.katrioplas at gmail.com (Konstantinos Katrioplas) Date: Mon, 27 Mar 2017 20:43:47 +0300 Subject: [scikit-learn] GSoC proposal - linear model Message-ID: Dear all, here is a draft of my proposal on improving online learning for linear models with softmax and AdaGrad. I look forward to your feedback, Konstantinos -------------- next part -------------- An HTML attachment was scrubbed... URL: From henriquecsj at gmail.com Mon Mar 27 13:46:08 2017 From: henriquecsj at gmail.com (Henrique C. S. Junior) Date: Mon, 27 Mar 2017 14:46:08 -0300 Subject: [scikit-learn] Using Scikit-Learn to predict magnetism in chemical systems In-Reply-To: References: Message-ID: Dear Robert, thank you. Yes, I'd like to talk about some specifics on the project. Thank you again. On Mon, Mar 27, 2017 at 2:25 PM, Robert Slater wrote: > You definitely can use some of the tools in sci-kit learn for supervised > machine learning. The real trick will be how well your training system is > representative of your future predictions. All of the various regression > algorithms would be of some value and you make even consider an ensemble to > help generalize. There will be some important questions to answer--what > kind of loss function do you want to look at? I assumed regression > (continuous response) but it could also classify--paramagnetic, > diamagnetic, ferromagnetic, etc... > > Another task to think about might be dimension reduction. > There is no guarantee you will get fantastic results--every problem is > unique and much will depend on exactly what you want out of the > solution--it may be that we get '10%' accuracy at best--for some systems > that is quite good, others it is horrible. > > If you'd like to talk specifics, feel free to contact me at this email. I > have a background in magnetism (PhD in magnetic multilayers--i was physics, > but as you are probably aware chemisty and physics blend in this area) and > have a fairly good knowledge of sci-kit learn and machine learning. > > > > On Mon, Mar 27, 2017 at 10:50 AM, Henrique C. S. Junior < > henriquecsj at gmail.com> wrote: > >> I'm a chemist with some rudimentary programming skills (getting started >> with python) and in the middle of the year I'll be starting a Ph.D. project >> that uses computers to describe magnetism in molecular systems. >> >> Most of the time I get my results after several simulations and >> experiments, so, I know that one of the hardest tasks in molecular >> magnetism is to predict the nature of magnetic interactions. That's why >> I'll try to tackle this problem with Machine Learning (because such >> interactions are dependent, basically, of distances, angles and number of >> unpaired electrons). The idea is to feed the computer with a large training >> set (with number of unpaired electrons, XYZ coordinates of each molecule >> and experimental magnetic couplings) and see if it can predict the magnetic >> couplings (J(AB)) of new systems: >> (see example in the attached image) >> >> Can Scikit-Learn handle the task, knowing that the matrix used to >> represent atomic coordinates will probably have a different number of atoms >> (because some molecules have more atoms than others)? Or is this a job >> better suited for another software/approach? ? >> >> >> -- >> *Henrique C. S. Junior* >> Industrial Chemist - UFRRJ >> M. Sc. Inorganic Chemistry - UFRRJ >> Data Processing Center - PMP >> Visite o Mundo Qu?mico >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- *Henrique C. S. Junior* Industrial Chemist - UFRRJ M. Sc. Inorganic Chemistry - UFRRJ Data Processing Center - PMP Visite o Mundo Qu?mico -------------- next part -------------- An HTML attachment was scrubbed... URL: From tommaso.costanzo01 at gmail.com Mon Mar 27 15:15:57 2017 From: tommaso.costanzo01 at gmail.com (Tommaso Costanzo) Date: Mon, 27 Mar 2017 15:15:57 -0400 Subject: [scikit-learn] Using Scikit-Learn to predict magnetism in chemical systems In-Reply-To: References: Message-ID: Dear Henrique, I agree with Robert on the use of a supervised algorithm and I would also suggest you to try a semisupervised one if you have trouble in labeling your data. Moreover, as a chemist I think that the input you are thinking to use is not the in the best form for machine learning because you are trying to predict coupling J(AB) but in the future space you have only coordinates (XYZ). What I suggest is to generate the pair of atoms externally and then use a matrix of the form (Mx3), where M are the pairs of atoms you want to predict your J and 3 are the features of the two atoms (distance, angle, unpaired electrons). For a supervised approach you will need a training set where the J is know so your training data will be of the form Mx4 and the fourth feature will be the J you know. Hope that this is clear, if not I will be happy to help more Sincerely Tommaso 2017-03-27 13:46 GMT-04:00 Henrique C. S. Junior : > Dear Robert, thank you. Yes, I'd like to talk about some specifics on the > project. > Thank you again. > > On Mon, Mar 27, 2017 at 2:25 PM, Robert Slater wrote: > >> You definitely can use some of the tools in sci-kit learn for supervised >> machine learning. The real trick will be how well your training system is >> representative of your future predictions. All of the various regression >> algorithms would be of some value and you make even consider an ensemble to >> help generalize. There will be some important questions to answer--what >> kind of loss function do you want to look at? I assumed regression >> (continuous response) but it could also classify--paramagnetic, >> diamagnetic, ferromagnetic, etc... >> >> Another task to think about might be dimension reduction. >> There is no guarantee you will get fantastic results--every problem is >> unique and much will depend on exactly what you want out of the >> solution--it may be that we get '10%' accuracy at best--for some systems >> that is quite good, others it is horrible. >> >> If you'd like to talk specifics, feel free to contact me at this email. >> I have a background in magnetism (PhD in magnetic multilayers--i was >> physics, but as you are probably aware chemisty and physics blend in this >> area) and have a fairly good knowledge of sci-kit learn and machine >> learning. >> >> >> >> On Mon, Mar 27, 2017 at 10:50 AM, Henrique C. S. Junior < >> henriquecsj at gmail.com> wrote: >> >>> I'm a chemist with some rudimentary programming skills (getting started >>> with python) and in the middle of the year I'll be starting a Ph.D. project >>> that uses computers to describe magnetism in molecular systems. >>> >>> Most of the time I get my results after several simulations and >>> experiments, so, I know that one of the hardest tasks in molecular >>> magnetism is to predict the nature of magnetic interactions. That's why >>> I'll try to tackle this problem with Machine Learning (because such >>> interactions are dependent, basically, of distances, angles and number of >>> unpaired electrons). The idea is to feed the computer with a large training >>> set (with number of unpaired electrons, XYZ coordinates of each molecule >>> and experimental magnetic couplings) and see if it can predict the magnetic >>> couplings (J(AB)) of new systems: >>> (see example in the attached image) >>> >>> Can Scikit-Learn handle the task, knowing that the matrix used to >>> represent atomic coordinates will probably have a different number of atoms >>> (because some molecules have more atoms than others)? Or is this a job >>> better suited for another software/approach? ? >>> >>> >>> -- >>> *Henrique C. S. Junior* >>> Industrial Chemist - UFRRJ >>> M. Sc. Inorganic Chemistry - UFRRJ >>> Data Processing Center - PMP >>> Visite o Mundo Qu?mico >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > *Henrique C. S. Junior* > Industrial Chemist - UFRRJ > M. Sc. Inorganic Chemistry - UFRRJ > Data Processing Center - PMP > Visite o Mundo Qu?mico > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Please do NOT send Microsoft Office Attachments: http://www.gnu.org/philosophy/no-word-attachments.html -------------- next part -------------- An HTML attachment was scrubbed... URL: From henriquecsj at gmail.com Mon Mar 27 18:44:24 2017 From: henriquecsj at gmail.com (Henrique C. S. Junior) Date: Mon, 27 Mar 2017 19:44:24 -0300 Subject: [scikit-learn] Using Scikit-Learn to predict magnetism in chemical systems In-Reply-To: References: Message-ID: Dear Tommaso, thank you for your kind reply. I know I have a lot to study before actually starting any code and that's why any suggestion is so valuable. So, you're suggesting that a simplification of the system using only the paramagnetic centers can be a good approach? (I'm not sure if I understood it correctly). My main idea was, at first, try to represent the systems as realistically as possible (using coordinates). I know that the software will not know what a bond is or what an intermolecular interaction is but, let's say, after including 1000s of examples in the training, I was expecting that (as an example) finding a C 0.000 and an H at 1.000 should start to "make sense" because it leads to an experimental trend. And I totally agree that my way to represent the system is not the better. Thank you so much for all the help. On Mon, Mar 27, 2017 at 4:15 PM, Tommaso Costanzo < tommaso.costanzo01 at gmail.com> wrote: > Dear Henrique, > > > I agree with Robert on the use of a supervised algorithm and I would also > suggest you to try a semisupervised one if you have trouble in labeling > your data. > > > Moreover, as a chemist I think that the input you are thinking to use is > not the in the best form for machine learning because you are trying to > predict coupling J(AB) but in the future space you have only coordinates > (XYZ). What I suggest is to generate the pair of atoms externally and then > use a matrix of the form (Mx3), where M are the pairs of atoms you want to > predict your J and 3 are the features of the two atoms (distance, angle, > unpaired electrons). For a supervised approach you will need a training set > where the J is know so your training data will be of the form Mx4 and the > fourth feature will be the J you know. > > Hope that this is clear, if not I will be happy to help more > > > Sincerely > > Tommaso > > 2017-03-27 13:46 GMT-04:00 Henrique C. S. Junior : > >> Dear Robert, thank you. Yes, I'd like to talk about some specifics on the >> project. >> Thank you again. >> >> On Mon, Mar 27, 2017 at 2:25 PM, Robert Slater >> wrote: >> >>> You definitely can use some of the tools in sci-kit learn for supervised >>> machine learning. The real trick will be how well your training system is >>> representative of your future predictions. All of the various regression >>> algorithms would be of some value and you make even consider an ensemble to >>> help generalize. There will be some important questions to answer--what >>> kind of loss function do you want to look at? I assumed regression >>> (continuous response) but it could also classify--paramagnetic, >>> diamagnetic, ferromagnetic, etc... >>> >>> Another task to think about might be dimension reduction. >>> There is no guarantee you will get fantastic results--every problem is >>> unique and much will depend on exactly what you want out of the >>> solution--it may be that we get '10%' accuracy at best--for some systems >>> that is quite good, others it is horrible. >>> >>> If you'd like to talk specifics, feel free to contact me at this email. >>> I have a background in magnetism (PhD in magnetic multilayers--i was >>> physics, but as you are probably aware chemisty and physics blend in this >>> area) and have a fairly good knowledge of sci-kit learn and machine >>> learning. >>> >>> >>> >>> On Mon, Mar 27, 2017 at 10:50 AM, Henrique C. S. Junior < >>> henriquecsj at gmail.com> wrote: >>> >>>> I'm a chemist with some rudimentary programming skills (getting started >>>> with python) and in the middle of the year I'll be starting a Ph.D. project >>>> that uses computers to describe magnetism in molecular systems. >>>> >>>> Most of the time I get my results after several simulations and >>>> experiments, so, I know that one of the hardest tasks in molecular >>>> magnetism is to predict the nature of magnetic interactions. That's why >>>> I'll try to tackle this problem with Machine Learning (because such >>>> interactions are dependent, basically, of distances, angles and number of >>>> unpaired electrons). The idea is to feed the computer with a large training >>>> set (with number of unpaired electrons, XYZ coordinates of each molecule >>>> and experimental magnetic couplings) and see if it can predict the magnetic >>>> couplings (J(AB)) of new systems: >>>> (see example in the attached image) >>>> >>>> Can Scikit-Learn handle the task, knowing that the matrix used to >>>> represent atomic coordinates will probably have a different number of atoms >>>> (because some molecules have more atoms than others)? Or is this a job >>>> better suited for another software/approach? ? >>>> >>>> >>>> -- >>>> *Henrique C. S. Junior* >>>> Industrial Chemist - UFRRJ >>>> M. Sc. Inorganic Chemistry - UFRRJ >>>> Data Processing Center - PMP >>>> Visite o Mundo Qu?mico >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> *Henrique C. S. Junior* >> Industrial Chemist - UFRRJ >> M. Sc. Inorganic Chemistry - UFRRJ >> Data Processing Center - PMP >> Visite o Mundo Qu?mico >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Please do NOT send Microsoft Office Attachments: > http://www.gnu.org/philosophy/no-word-attachments.html > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- *Henrique C. S. Junior* Industrial Chemist - UFRRJ M. Sc. Inorganic Chemistry - UFRRJ Data Processing Center - PMP Visite o Mundo Qu?mico -------------- next part -------------- An HTML attachment was scrubbed... URL: From tommaso.costanzo01 at gmail.com Mon Mar 27 19:35:21 2017 From: tommaso.costanzo01 at gmail.com (Tommaso Costanzo) Date: Mon, 27 Mar 2017 19:35:21 -0400 Subject: [scikit-learn] Using Scikit-Learn to predict magnetism in chemical systems In-Reply-To: References: Message-ID: Dear Henrique, I am sorry for the poor email I wrote before. What I was saying is simply the fact that if you are trying to use the coordinates as "features" from an .xyz file then by machine learning you will learn at wich coordinate certain atoms will occur so you can only make prediction on the coordinate. However, if I correctly understood, the "features" representing the coupling J are distance, angle, and electron number. Definitely this properties can be derived from the XYZ file format from simple geometric calculations and the number of electrons will depend from the type of atom. So, what I was trying to say is that instead of using the XYZ file as input for scikit-learn, I was suggesting to do the calculation of angle, distances, electrons' number in advance (with other software(s) or directly in python) and use the new calculated matrix as input for scikit-learn. In this case the machine will learn how J(AB) varies as a function of angle, distance, number of electrons. For example distance angle n el. 1 90 1 1 90 1 2 90 1 .... ... ... If you are using a supervised learning you will have to add a 4th column ( in reality a separate column vector) with your J(AB) on which you can train your model and then predict the unknown samples For example distance angle n el. J(AB) 1 90 1 1 1 90 1 1 2 90 1 0.5 .... ... ... ... Now if you train the model on the second matrix, and then you try to predict the first one you should expect a results like: 1 1 0.5 Of course in this case the "features" are perfectly equal, hence the example is completely unrealistic. However, I hope that it will help to understand what I was explaining in the previous email. If you want you can directly contact me at this email, and I hope that you got additional hints from Robert, that he seems to be even more knowledgeable than me. Sincerely Tommaso 2017-03-27 18:44 GMT-04:00 Henrique C. S. Junior : > Dear Tommaso, thank you for your kind reply. > I know I have a lot to study before actually starting any code and that's > why any suggestion is so valuable. > So, you're suggesting that a simplification of the system using only the > paramagnetic centers can be a good approach? (I'm not sure if I understood > it correctly). > My main idea was, at first, try to represent the systems as realistically > as possible (using coordinates). I know that the software will not know > what a bond is or what an intermolecular interaction is but, let's say, > after including 1000s of examples in the training, I was expecting that (as > an example) finding a C 0.000 and an H at 1.000 should start to "make > sense" because it leads to an experimental trend. And I totally agree that > my way to represent the system is not the better. > > Thank you so much for all the help. > > On Mon, Mar 27, 2017 at 4:15 PM, Tommaso Costanzo < > tommaso.costanzo01 at gmail.com> wrote: > >> Dear Henrique, >> >> >> I agree with Robert on the use of a supervised algorithm and I would also >> suggest you to try a semisupervised one if you have trouble in labeling >> your data. >> >> >> Moreover, as a chemist I think that the input you are thinking to use is >> not the in the best form for machine learning because you are trying to >> predict coupling J(AB) but in the future space you have only coordinates >> (XYZ). What I suggest is to generate the pair of atoms externally and then >> use a matrix of the form (Mx3), where M are the pairs of atoms you want to >> predict your J and 3 are the features of the two atoms (distance, angle, >> unpaired electrons). For a supervised approach you will need a training set >> where the J is know so your training data will be of the form Mx4 and the >> fourth feature will be the J you know. >> >> Hope that this is clear, if not I will be happy to help more >> >> >> Sincerely >> >> Tommaso >> >> 2017-03-27 13:46 GMT-04:00 Henrique C. S. Junior : >> >>> Dear Robert, thank you. Yes, I'd like to talk about some specifics on >>> the project. >>> Thank you again. >>> >>> On Mon, Mar 27, 2017 at 2:25 PM, Robert Slater >>> wrote: >>> >>>> You definitely can use some of the tools in sci-kit learn for >>>> supervised machine learning. The real trick will be how well your training >>>> system is representative of your future predictions. All of the various >>>> regression algorithms would be of some value and you make even consider an >>>> ensemble to help generalize. There will be some important questions to >>>> answer--what kind of loss function do you want to look at? I assumed >>>> regression (continuous response) but it could also classify--paramagnetic, >>>> diamagnetic, ferromagnetic, etc... >>>> >>>> Another task to think about might be dimension reduction. >>>> There is no guarantee you will get fantastic results--every problem is >>>> unique and much will depend on exactly what you want out of the >>>> solution--it may be that we get '10%' accuracy at best--for some systems >>>> that is quite good, others it is horrible. >>>> >>>> If you'd like to talk specifics, feel free to contact me at this >>>> email. I have a background in magnetism (PhD in magnetic multilayers--i >>>> was physics, but as you are probably aware chemisty and physics blend in >>>> this area) and have a fairly good knowledge of sci-kit learn and machine >>>> learning. >>>> >>>> >>>> >>>> On Mon, Mar 27, 2017 at 10:50 AM, Henrique C. S. Junior < >>>> henriquecsj at gmail.com> wrote: >>>> >>>>> I'm a chemist with some rudimentary programming skills (getting >>>>> started with python) and in the middle of the year I'll be starting a Ph.D. >>>>> project that uses computers to describe magnetism in molecular systems. >>>>> >>>>> Most of the time I get my results after several simulations and >>>>> experiments, so, I know that one of the hardest tasks in molecular >>>>> magnetism is to predict the nature of magnetic interactions. That's why >>>>> I'll try to tackle this problem with Machine Learning (because such >>>>> interactions are dependent, basically, of distances, angles and number of >>>>> unpaired electrons). The idea is to feed the computer with a large training >>>>> set (with number of unpaired electrons, XYZ coordinates of each molecule >>>>> and experimental magnetic couplings) and see if it can predict the magnetic >>>>> couplings (J(AB)) of new systems: >>>>> (see example in the attached image) >>>>> >>>>> Can Scikit-Learn handle the task, knowing that the matrix used to >>>>> represent atomic coordinates will probably have a different number of atoms >>>>> (because some molecules have more atoms than others)? Or is this a job >>>>> better suited for another software/approach? ? >>>>> >>>>> >>>>> -- >>>>> *Henrique C. S. Junior* >>>>> Industrial Chemist - UFRRJ >>>>> M. Sc. Inorganic Chemistry - UFRRJ >>>>> Data Processing Center - PMP >>>>> Visite o Mundo Qu?mico >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> >>> -- >>> *Henrique C. S. Junior* >>> Industrial Chemist - UFRRJ >>> M. Sc. Inorganic Chemistry - UFRRJ >>> Data Processing Center - PMP >>> Visite o Mundo Qu?mico >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> Please do NOT send Microsoft Office Attachments: >> http://www.gnu.org/philosophy/no-word-attachments.html >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > *Henrique C. S. Junior* > Industrial Chemist - UFRRJ > M. Sc. Inorganic Chemistry - UFRRJ > Data Processing Center - PMP > Visite o Mundo Qu?mico > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Please do NOT send Microsoft Office Attachments: http://www.gnu.org/philosophy/no-word-attachments.html -------------- next part -------------- An HTML attachment was scrubbed... URL: From ross at cgl.ucsf.edu Tue Mar 28 01:12:22 2017 From: ross at cgl.ucsf.edu (Bill Ross) Date: Mon, 27 Mar 2017 22:12:22 -0700 Subject: [scikit-learn] Using Scikit-Learn to predict magnetism in chemical systems In-Reply-To: References: Message-ID: <95b33ee9-cf78-e7a0-775a-140ca5f03e17@cgl.ucsf.edu> Image processing deals with xy coordinates by (as I understand) training with multiple permutations of the raw data, in the form of translations and rotations in the 2d space. If training with 3d data, there would be that much more translating and rotating to do, in order to divorce the learning from the incidentals. Bill On 3/27/17 4:35 PM, Tommaso Costanzo wrote: > Dear Henrique, > I am sorry for the poor email I wrote before. What I was saying is > simply the fact that if you are trying to use the coordinates as > "features" from an .xyz file then by machine learning you will learn > at wich coordinate certain atoms will occur so you can only make > prediction on the coordinate. However, if I correctly understood, the > "features" representing the coupling J are distance, angle, and > electron number. Definitely this properties can be derived from the > XYZ file format from simple geometric calculations and the number of > electrons will depend from the type of atom. So, what I was trying to > say is that instead of using the XYZ file as input for scikit-learn, I > was suggesting to do the calculation of angle, distances, electrons' > number in advance (with other software(s) or directly in python) and > use the new calculated matrix as input for scikit-learn. In this case > the machine will learn how J(AB) varies as a function of angle, > distance, number of electrons. > For example > > distance angle n el. > 1 90 1 > 1 90 1 > 2 90 1 > .... ... ... > > If you are using a supervised learning you will have to add a 4th > column ( in reality a separate column vector) with your J(AB) on which > you can train your model and then predict the unknown samples > > For example > distance angle n el. J(AB) > 1 90 1 1 > 1 90 1 1 > 2 90 1 0.5 > .... ... ... ... > > Now if you train the model on the second matrix, and then you try to > predict the first one you should expect a results like: > > 1 > 1 > 0.5 > > Of course in this case the "features" are perfectly equal, hence the > example is completely unrealistic. However, I hope that it will help > to understand what I was explaining in the previous email. > If you want you can directly contact me at this email, and I hope that > you got additional hints from Robert, that he seems to be even more > knowledgeable than me. > > Sincerely > Tommaso > > > > 2017-03-27 18:44 GMT-04:00 Henrique C. S. Junior > >: > > Dear Tommaso, thank you for your kind reply. > I know I have a lot to study before actually starting any code and > that's why any suggestion is so valuable. > So, you're suggesting that a simplification of the system using > only the paramagnetic centers can be a good approach? (I'm not > sure if I understood it correctly). > My main idea was, at first, try to represent the systems as > realistically as possible (using coordinates). I know that the > software will not know what a bond is or what an intermolecular > interaction is but, let's say, after including 1000s of examples > in the training, I was expecting that (as an example) finding a C > 0.000 and an H at 1.000 should start to "make sense" because it > leads to an experimental trend. And I totally agree that my way to > represent the system is not the better. > > Thank you so much for all the help. > > On Mon, Mar 27, 2017 at 4:15 PM, Tommaso Costanzo > > wrote: > > Dear Henrique, > > > I agree with Robert on the use of a supervised algorithm and I > would also suggest you to try a semisupervised one if you have > trouble in labeling your data. > > > Moreover, as a chemist I think that the input you are thinking > to use is not the in the best form for machine learning > because you are trying to predict coupling J(AB) but in the > future space you have only coordinates (XYZ). What I suggest > is to generate the pair of atoms externally and then use a > matrix of the form (Mx3), where M are the pairs of atoms you > want to predict your J and 3 are the features of the two atoms > (distance, angle, unpaired electrons). For a supervised > approach you will need a training set where the J is know so > your training data will be of the form Mx4 and the fourth > feature will be the J you know. > > Hope that this is clear, if not I will be happy to help more > > > Sincerely > > Tommaso > > > 2017-03-27 13:46 GMT-04:00 Henrique C. S. Junior > >: > > Dear Robert, thank you. Yes, I'd like to talk about some > specifics on the project. > Thank you again. > > On Mon, Mar 27, 2017 at 2:25 PM, Robert Slater > > wrote: > > You definitely can use some of the tools in sci-kit > learn for supervised machine learning. The real trick > will be how well your training system is > representative of your future predictions. All of the > various regression algorithms would be of some value > and you make even consider an ensemble to help > generalize. There will be some important questions to > answer--what kind of loss function do you want to look > at? I assumed regression (continuous response) but it > could also classify--paramagnetic, diamagnetic, > ferromagnetic, etc... > > Another task to think about might be dimension reduction. > There is no guarantee you will get fantastic > results--every problem is unique and much will depend > on exactly what you want out of the solution--it may > be that we get '10%' accuracy at best--for some > systems that is quite good, others it is horrible. > > If you'd like to talk specifics, feel free to contact > me at this email. I have a background in magnetism > (PhD in magnetic multilayers--i was physics, but as > you are probably aware chemisty and physics blend in > this area) and have a fairly good knowledge of sci-kit > learn and machine learning. > > > > On Mon, Mar 27, 2017 at 10:50 AM, Henrique C. S. > Junior > wrote: > > I'm a chemist with some rudimentary programming > skills (getting started with python) and in the > middle of the year I'll be starting a Ph.D. > project that uses computers to describe magnetism > in molecular systems. > > Most of the time I get my results after several > simulations and experiments, so, I know that one > of the hardest tasks in molecular magnetism is to > predict the nature of magnetic interactions. > That's why I'll try to tackle this problem with > Machine Learning (because such interactions are > dependent, basically, of distances, angles and > number of unpaired electrons). The idea is to feed > the computer with a large training set (with > number of unpaired electrons, XYZ coordinates of > each molecule and experimental magnetic couplings) > and see if it can predict the magnetic couplings > (J(AB)) of new systems: > > (see example in the attached image) > > Can Scikit-Learn handle the task, knowing that the > matrix used to represent atomic coordinates will > probably have a different number of atoms (because > some molecules have more atoms than others)? Or is > this a job better suited for another > software/approach? ? > > > -- > *Henrique C. S. Junior* > Industrial Chemist - UFRRJ > M. Sc. Inorganic Chemistry - UFRRJ > Data Processing Center - PMP > Visite o Mundo Qu?mico > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > -- > *Henrique C. S. Junior* > Industrial Chemist - UFRRJ > M. Sc. Inorganic Chemistry - UFRRJ > Data Processing Center - PMP > Visite o Mundo Qu?mico > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > -- > Please do NOT send Microsoft Office Attachments: > http://www.gnu.org/philosophy/no-word-attachments.html > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > -- > *Henrique C. S. Junior* > Industrial Chemist - UFRRJ > M. Sc. Inorganic Chemistry - UFRRJ > Data Processing Center - PMP > Visite o Mundo Qu?mico > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > -- > Please do NOT send Microsoft Office Attachments: > http://www.gnu.org/philosophy/no-word-attachments.html > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From henriquecsj at gmail.com Tue Mar 28 12:48:14 2017 From: henriquecsj at gmail.com (Henrique C. S. Junior) Date: Tue, 28 Mar 2017 13:48:14 -0300 Subject: [scikit-learn] Using Scikit-Learn to predict magnetism in chemical systems In-Reply-To: <95b33ee9-cf78-e7a0-775a-140ca5f03e17@cgl.ucsf.edu> References: <95b33ee9-cf78-e7a0-775a-140ca5f03e17@cgl.ucsf.edu> Message-ID: @Tommaso, this is something like Internal Coordinates[1], right? @Bill, thanks for the hint, I'll definitely take a look at this. [1] - https://en.wikipedia.org/wiki/Z-matrix_(chemistry) On Tue, Mar 28, 2017 at 2:12 AM, Bill Ross wrote: > Image processing deals with xy coordinates by (as I understand) training > with multiple permutations of the raw data, in the form of translations and > rotations in the 2d space. If training with 3d data, there would be that > much more translating and rotating to do, in order to divorce the learning > from the incidentals. > > Bill > > On 3/27/17 4:35 PM, Tommaso Costanzo wrote: > > Dear Henrique, > I am sorry for the poor email I wrote before. What I was saying is simply > the fact that if you are trying to use the coordinates as "features" from > an .xyz file then by machine learning you will learn at wich coordinate > certain atoms will occur so you can only make prediction on the coordinate. > However, if I correctly understood, the "features" representing the > coupling J are distance, angle, and electron number. Definitely this > properties can be derived from the XYZ file format from simple geometric > calculations and the number of electrons will depend from the type of atom. > So, what I was trying to say is that instead of using the XYZ file as input > for scikit-learn, I was suggesting to do the calculation of angle, > distances, electrons' number in advance (with other software(s) or directly > in python) and use the new calculated matrix as input for scikit-learn. In > this case the machine will learn how J(AB) varies as a function of angle, > distance, number of electrons. > For example > > distance angle n el. > 1 90 1 > 1 90 1 > 2 90 1 > .... ... ... > > If you are using a supervised learning you will have to add a 4th column ( > in reality a separate column vector) with your J(AB) on which you can train > your model and then predict the unknown samples > > For example > distance angle n el. J(AB) > 1 90 1 1 > 1 90 1 1 > 2 90 1 0.5 > .... ... ... ... > > Now if you train the model on the second matrix, and then you try to > predict the first one you should expect a results like: > > 1 > 1 > 0.5 > > Of course in this case the "features" are perfectly equal, hence the > example is completely unrealistic. However, I hope that it will help to > understand what I was explaining in the previous email. > If you want you can directly contact me at this email, and I hope that you > got additional hints from Robert, that he seems to be even more > knowledgeable than me. > > Sincerely > Tommaso > > > > 2017-03-27 18:44 GMT-04:00 Henrique C. S. Junior : > >> Dear Tommaso, thank you for your kind reply. >> I know I have a lot to study before actually starting any code and that's >> why any suggestion is so valuable. >> So, you're suggesting that a simplification of the system using only the >> paramagnetic centers can be a good approach? (I'm not sure if I understood >> it correctly). >> My main idea was, at first, try to represent the systems as realistically >> as possible (using coordinates). I know that the software will not know >> what a bond is or what an intermolecular interaction is but, let's say, >> after including 1000s of examples in the training, I was expecting that (as >> an example) finding a C 0.000 and an H at 1.000 should start to "make >> sense" because it leads to an experimental trend. And I totally agree that >> my way to represent the system is not the better. >> >> Thank you so much for all the help. >> >> On Mon, Mar 27, 2017 at 4:15 PM, Tommaso Costanzo < >> tommaso.costanzo01 at gmail.com> wrote: >> >>> Dear Henrique, >>> >>> >>> I agree with Robert on the use of a supervised algorithm and I would >>> also suggest you to try a semisupervised one if you have trouble in >>> labeling your data. >>> >>> >>> Moreover, as a chemist I think that the input you are thinking to use is >>> not the in the best form for machine learning because you are trying to >>> predict coupling J(AB) but in the future space you have only coordinates >>> (XYZ). What I suggest is to generate the pair of atoms externally and then >>> use a matrix of the form (Mx3), where M are the pairs of atoms you want to >>> predict your J and 3 are the features of the two atoms (distance, angle, >>> unpaired electrons). For a supervised approach you will need a training set >>> where the J is know so your training data will be of the form Mx4 and the >>> fourth feature will be the J you know. >>> >>> Hope that this is clear, if not I will be happy to help more >>> >>> >>> Sincerely >>> >>> Tommaso >>> >>> 2017-03-27 13:46 GMT-04:00 Henrique C. S. Junior >>> : >>> >>>> Dear Robert, thank you. Yes, I'd like to talk about some specifics on >>>> the project. >>>> Thank you again. >>>> >>>> On Mon, Mar 27, 2017 at 2:25 PM, Robert Slater >>>> wrote: >>>> >>>>> You definitely can use some of the tools in sci-kit learn for >>>>> supervised machine learning. The real trick will be how well your training >>>>> system is representative of your future predictions. All of the various >>>>> regression algorithms would be of some value and you make even consider an >>>>> ensemble to help generalize. There will be some important questions to >>>>> answer--what kind of loss function do you want to look at? I assumed >>>>> regression (continuous response) but it could also classify--paramagnetic, >>>>> diamagnetic, ferromagnetic, etc... >>>>> >>>>> Another task to think about might be dimension reduction. >>>>> There is no guarantee you will get fantastic results--every problem is >>>>> unique and much will depend on exactly what you want out of the >>>>> solution--it may be that we get '10%' accuracy at best--for some systems >>>>> that is quite good, others it is horrible. >>>>> >>>>> If you'd like to talk specifics, feel free to contact me at this >>>>> email. I have a background in magnetism (PhD in magnetic multilayers--i >>>>> was physics, but as you are probably aware chemisty and physics blend in >>>>> this area) and have a fairly good knowledge of sci-kit learn and machine >>>>> learning. >>>>> >>>>> >>>>> >>>>> On Mon, Mar 27, 2017 at 10:50 AM, Henrique C. S. Junior < >>>>> henriquecsj at gmail.com> wrote: >>>>> >>>>>> I'm a chemist with some rudimentary programming skills (getting >>>>>> started with python) and in the middle of the year I'll be starting a Ph.D. >>>>>> project that uses computers to describe magnetism in molecular systems. >>>>>> >>>>>> Most of the time I get my results after several simulations and >>>>>> experiments, so, I know that one of the hardest tasks in molecular >>>>>> magnetism is to predict the nature of magnetic interactions. That's why >>>>>> I'll try to tackle this problem with Machine Learning (because such >>>>>> interactions are dependent, basically, of distances, angles and number of >>>>>> unpaired electrons). The idea is to feed the computer with a large training >>>>>> set (with number of unpaired electrons, XYZ coordinates of each molecule >>>>>> and experimental magnetic couplings) and see if it can predict the magnetic >>>>>> couplings (J(AB)) of new systems: >>>>>> (see example in the attached image) >>>>>> >>>>>> Can Scikit-Learn handle the task, knowing that the matrix used to >>>>>> represent atomic coordinates will probably have a different number of atoms >>>>>> (because some molecules have more atoms than others)? Or is this a job >>>>>> better suited for another software/approach? ? >>>>>> >>>>>> >>>>>> -- >>>>>> *Henrique C. S. Junior* >>>>>> Industrial Chemist - UFRRJ >>>>>> M. Sc. Inorganic Chemistry - UFRRJ >>>>>> Data Processing Center - PMP >>>>>> Visite o Mundo Qu?mico >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> >>>> -- >>>> *Henrique C. S. Junior* >>>> Industrial Chemist - UFRRJ >>>> M. Sc. Inorganic Chemistry - UFRRJ >>>> Data Processing Center - PMP >>>> Visite o Mundo Qu?mico >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> >>> -- >>> Please do NOT send Microsoft Office Attachments: >>> http://www.gnu.org/philosophy/no-word-attachments.html >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> *Henrique C. S. Junior* >> Industrial Chemist - UFRRJ >> M. Sc. Inorganic Chemistry - UFRRJ >> Data Processing Center - PMP >> Visite o Mundo Qu?mico >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Please do NOT send Microsoft Office Attachments: > http://www.gnu.org/philosophy/no-word-attachments.html > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- *Henrique C. S. Junior* Industrial Chemist - UFRRJ M. Sc. Inorganic Chemistry - UFRRJ Data Processing Center - PMP Visite o Mundo Qu?mico -------------- next part -------------- An HTML attachment was scrubbed... URL: From ross at cgl.ucsf.edu Tue Mar 28 13:07:57 2017 From: ross at cgl.ucsf.edu (Bill Ross) Date: Tue, 28 Mar 2017 10:07:57 -0700 Subject: [scikit-learn] Using Scikit-Learn to predict magnetism in chemical systems In-Reply-To: References: <95b33ee9-cf78-e7a0-775a-140ca5f03e17@cgl.ucsf.edu> Message-ID: <1d5a4898-3644-dca7-4df3-79959c8b0d0b@cgl.ucsf.edu> I think I saw it in the Deep Learning book: http://www.deeplearningbook.org/ Bill On 3/28/17 9:48 AM, Henrique C. S. Junior wrote: > @Tommaso, this is something like Internal Coordinates[1], right? > @Bill, thanks for the hint, I'll definitely take a look at this. > > [1] - https://en.wikipedia.org/wiki/Z-matrix_(chemistry) > > > On Tue, Mar 28, 2017 at 2:12 AM, Bill Ross > wrote: > > Image processing deals with xy coordinates by (as I understand) > training with multiple permutations of the raw data, in the form > of translations and rotations in the 2d space. If training with 3d > data, there would be that much more translating and rotating to > do, in order to divorce the learning from the incidentals. > > Bill > > > On 3/27/17 4:35 PM, Tommaso Costanzo wrote: >> Dear Henrique, >> I am sorry for the poor email I wrote before. What I was saying >> is simply the fact that if you are trying to use the coordinates >> as "features" from an .xyz file then by machine learning you will >> learn at wich coordinate certain atoms will occur so you can only >> make prediction on the coordinate. However, if I correctly >> understood, the "features" representing the coupling J are >> distance, angle, and electron number. Definitely this properties >> can be derived from the XYZ file format from simple geometric >> calculations and the number of electrons will depend from the >> type of atom. So, what I was trying to say is that instead of >> using the XYZ file as input for scikit-learn, I was suggesting to >> do the calculation of angle, distances, electrons' number in >> advance (with other software(s) or directly in python) and use >> the new calculated matrix as input for scikit-learn. In this case >> the machine will learn how J(AB) varies as a function of angle, >> distance, number of electrons. >> For example >> >> distance angle n el. >> 1 90 1 >> 1 90 1 >> 2 90 1 >> .... ... ... >> >> If you are using a supervised learning you will have to add a 4th >> column ( in reality a separate column vector) with your J(AB) on >> which you can train your model and then predict the unknown samples >> >> For example >> distance angle n el. J(AB) >> 1 90 1 1 >> 1 90 1 1 >> 2 90 1 0.5 >> .... ... ... ... >> >> Now if you train the model on the second matrix, and then you try >> to predict the first one you should expect a results like: >> >> 1 >> 1 >> 0.5 >> >> Of course in this case the "features" are perfectly equal, hence >> the example is completely unrealistic. However, I hope that it >> will help to understand what I was explaining in the previous email. >> If you want you can directly contact me at this email, and I hope >> that you got additional hints from Robert, that he seems to be >> even more knowledgeable than me. >> >> Sincerely >> Tommaso >> >> >> >> 2017-03-27 18:44 GMT-04:00 Henrique C. S. Junior >> >: >> >> Dear Tommaso, thank you for your kind reply. >> I know I have a lot to study before actually starting any >> code and that's why any suggestion is so valuable. >> So, you're suggesting that a simplification of the system >> using only the paramagnetic centers can be a good approach? >> (I'm not sure if I understood it correctly). >> My main idea was, at first, try to represent the systems as >> realistically as possible (using coordinates). I know that >> the software will not know what a bond is or what an >> intermolecular interaction is but, let's say, after including >> 1000s of examples in the training, I was expecting that (as >> an example) finding a C 0.000 and an H at 1.000 should start >> to "make sense" because it leads to an experimental trend. >> And I totally agree that my way to represent the system is >> not the better. >> >> Thank you so much for all the help. >> >> On Mon, Mar 27, 2017 at 4:15 PM, Tommaso Costanzo >> > > wrote: >> >> Dear Henrique, >> >> >> I agree with Robert on the use of a supervised algorithm >> and I would also suggest you to try a semisupervised one >> if you have trouble in labeling your data. >> >> >> Moreover, as a chemist I think that the input you are >> thinking to use is not the in the best form for machine >> learning because you are trying to predict coupling J(AB) >> but in the future space you have only coordinates (XYZ). >> What I suggest is to generate the pair of atoms >> externally and then use a matrix of the form (Mx3), where >> M are the pairs of atoms you want to predict your J and 3 >> are the features of the two atoms (distance, angle, >> unpaired electrons). For a supervised approach you will >> need a training set where the J is know so your training >> data will be of the form Mx4 and the fourth feature will >> be the J you know. >> >> Hope that this is clear, if not I will be happy to help more >> >> >> Sincerely >> >> Tommaso >> >> >> 2017-03-27 13:46 GMT-04:00 Henrique C. S. Junior >> >: >> >> Dear Robert, thank you. Yes, I'd like to talk about >> some specifics on the project. >> Thank you again. >> >> On Mon, Mar 27, 2017 at 2:25 PM, Robert Slater >> > wrote: >> >> You definitely can use some of the tools in >> sci-kit learn for supervised machine learning. >> The real trick will be how well your training >> system is representative of your future >> predictions. All of the various regression >> algorithms would be of some value and you make >> even consider an ensemble to help generalize. >> There will be some important questions to >> answer--what kind of loss function do you want to >> look at? I assumed regression (continuous >> response) but it could also >> classify--paramagnetic, diamagnetic, >> ferromagnetic, etc... >> >> Another task to think about might be dimension >> reduction. >> There is no guarantee you will get fantastic >> results--every problem is unique and much will >> depend on exactly what you want out of the >> solution--it may be that we get '10%' accuracy at >> best--for some systems that is quite good, others >> it is horrible. >> >> If you'd like to talk specifics, feel free to >> contact me at this email. I have a background in >> magnetism (PhD in magnetic multilayers--i was >> physics, but as you are probably aware chemisty >> and physics blend in this area) and have a fairly >> good knowledge of sci-kit learn and machine >> learning. >> >> >> >> On Mon, Mar 27, 2017 at 10:50 AM, Henrique C. S. >> Junior > > wrote: >> >> I'm a chemist with some rudimentary >> programming skills (getting started with >> python) and in the middle of the year I'll be >> starting a Ph.D. project that uses computers >> to describe magnetism in molecular systems. >> >> Most of the time I get my results after >> several simulations and experiments, so, I >> know that one of the hardest tasks in >> molecular magnetism is to predict the nature >> of magnetic interactions. That's why I'll try >> to tackle this problem with Machine Learning >> (because such interactions are dependent, >> basically, of distances, angles and number of >> unpaired electrons). The idea is to feed the >> computer with a large training set (with >> number of unpaired electrons, XYZ coordinates >> of each molecule and experimental magnetic >> couplings) and see if it can predict the >> magnetic couplings (J(AB)) of new systems: >> >> (see example in the attached image) >> >> Can Scikit-Learn handle the task, knowing >> that the matrix used to represent atomic >> coordinates will probably have a different >> number of atoms (because some molecules have >> more atoms than others)? Or is this a job >> better suited for another software/approach? ? >> >> >> -- >> *Henrique C. S. Junior* >> Industrial Chemist - UFRRJ >> M. Sc. Inorganic Chemistry - UFRRJ >> Data Processing Center - PMP >> Visite o Mundo Qu?mico >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> >> -- >> *Henrique C. S. Junior* >> Industrial Chemist - UFRRJ >> M. Sc. Inorganic Chemistry - UFRRJ >> Data Processing Center - PMP >> Visite o Mundo Qu?mico >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> >> -- >> Please do NOT send Microsoft Office Attachments: >> http://www.gnu.org/philosophy/no-word-attachments.html >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> >> -- >> *Henrique C. S. Junior* >> Industrial Chemist - UFRRJ >> M. Sc. Inorganic Chemistry - UFRRJ >> Data Processing Center - PMP >> Visite o Mundo Qu?mico >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> >> -- >> Please do NOT send Microsoft Office Attachments: >> http://www.gnu.org/philosophy/no-word-attachments.html >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ scikit-learn > mailing list scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > *Henrique C. S. Junior* Industrial Chemist - UFRRJ > M. Sc. Inorganic Chemistry - UFRRJ Data Processing Center - PMP > Visite o Mundo Qu?mico > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From tommaso.costanzo01 at gmail.com Tue Mar 28 14:57:52 2017 From: tommaso.costanzo01 at gmail.com (Tommaso Costanzo) Date: Tue, 28 Mar 2017 14:57:52 -0400 Subject: [scikit-learn] Using Scikit-Learn to predict magnetism in chemical systems In-Reply-To: References: <95b33ee9-cf78-e7a0-775a-140ca5f03e17@cgl.ucsf.edu> Message-ID: Dear Henrique, Yes, my previous representation looks like a Z-matrix format (BTW in scikit-learn you will need to have the same number of columns at every line, so you will need to fill somehow the first line). However, I will take this email as the opportunity to stress the fact that you do not have to stick to a specific file format, but the the features/columns/(2nd index of a 2D matrix) they have to represent the properties/parameters that they will directly affect changes in what you are trying to predict. In fact in your columns you can even add more features compared to the priviosly cited and probably you will describe the system in a better way. What I can think right now, just for the seak of a better explanation, you can use: bond lenght, atoms type, number of unpaired electrons, total number of electrons, diedral angle of the two atoms, number of atoms between the pair (e.g. if you have Mn--O--Mn there is an oxygen between the two Mn that you might want to look at the coupling) and so on and so forth. The number of parameters you will have to use will solely depend on your system and what you need to describe, but it will not affect in any case scikit-learn routines. Basically every 2D matrix of number will work on scikil-learn, but in order to make these number to have physical meaning that will depend on what the number represent. Let me know if it make more sense Sincerely Tommaso On Mar 28, 2017 12:51 PM, "Henrique C. S. Junior" wrote: @Tommaso, this is something like Internal Coordinates[1], right? @Bill, thanks for the hint, I'll definitely take a look at this. [1] - https://en.wikipedia.org/wiki/Z-matrix_(chemistry) On Tue, Mar 28, 2017 at 2:12 AM, Bill Ross wrote: > Image processing deals with xy coordinates by (as I understand) training > with multiple permutations of the raw data, in the form of translations and > rotations in the 2d space. If training with 3d data, there would be that > much more translating and rotating to do, in order to divorce the learning > from the incidentals. > > Bill > > On 3/27/17 4:35 PM, Tommaso Costanzo wrote: > > Dear Henrique, > I am sorry for the poor email I wrote before. What I was saying is simply > the fact that if you are trying to use the coordinates as "features" from > an .xyz file then by machine learning you will learn at wich coordinate > certain atoms will occur so you can only make prediction on the coordinate. > However, if I correctly understood, the "features" representing the > coupling J are distance, angle, and electron number. Definitely this > properties can be derived from the XYZ file format from simple geometric > calculations and the number of electrons will depend from the type of atom. > So, what I was trying to say is that instead of using the XYZ file as input > for scikit-learn, I was suggesting to do the calculation of angle, > distances, electrons' number in advance (with other software(s) or directly > in python) and use the new calculated matrix as input for scikit-learn. In > this case the machine will learn how J(AB) varies as a function of angle, > distance, number of electrons. > For example > > distance angle n el. > 1 90 1 > 1 90 1 > 2 90 1 > .... ... ... > > If you are using a supervised learning you will have to add a 4th column ( > in reality a separate column vector) with your J(AB) on which you can train > your model and then predict the unknown samples > > For example > distance angle n el. J(AB) > 1 90 1 1 > 1 90 1 1 > 2 90 1 0.5 > .... ... ... ... > > Now if you train the model on the second matrix, and then you try to > predict the first one you should expect a results like: > > 1 > 1 > 0.5 > > Of course in this case the "features" are perfectly equal, hence the > example is completely unrealistic. However, I hope that it will help to > understand what I was explaining in the previous email. > If you want you can directly contact me at this email, and I hope that you > got additional hints from Robert, that he seems to be even more > knowledgeable than me. > > Sincerely > Tommaso > > > > 2017-03-27 18:44 GMT-04:00 Henrique C. S. Junior : > >> Dear Tommaso, thank you for your kind reply. >> I know I have a lot to study before actually starting any code and that's >> why any suggestion is so valuable. >> So, you're suggesting that a simplification of the system using only the >> paramagnetic centers can be a good approach? (I'm not sure if I understood >> it correctly). >> My main idea was, at first, try to represent the systems as realistically >> as possible (using coordinates). I know that the software will not know >> what a bond is or what an intermolecular interaction is but, let's say, >> after including 1000s of examples in the training, I was expecting that (as >> an example) finding a C 0.000 and an H at 1.000 should start to "make >> sense" because it leads to an experimental trend. And I totally agree that >> my way to represent the system is not the better. >> >> Thank you so much for all the help. >> >> On Mon, Mar 27, 2017 at 4:15 PM, Tommaso Costanzo < >> tommaso.costanzo01 at gmail.com> wrote: >> >>> Dear Henrique, >>> >>> >>> I agree with Robert on the use of a supervised algorithm and I would >>> also suggest you to try a semisupervised one if you have trouble in >>> labeling your data. >>> >>> >>> Moreover, as a chemist I think that the input you are thinking to use is >>> not the in the best form for machine learning because you are trying to >>> predict coupling J(AB) but in the future space you have only coordinates >>> (XYZ). What I suggest is to generate the pair of atoms externally and then >>> use a matrix of the form (Mx3), where M are the pairs of atoms you want to >>> predict your J and 3 are the features of the two atoms (distance, angle, >>> unpaired electrons). For a supervised approach you will need a training set >>> where the J is know so your training data will be of the form Mx4 and the >>> fourth feature will be the J you know. >>> >>> Hope that this is clear, if not I will be happy to help more >>> >>> >>> Sincerely >>> >>> Tommaso >>> >>> 2017-03-27 13:46 GMT-04:00 Henrique C. S. Junior >>> : >>> >>>> Dear Robert, thank you. Yes, I'd like to talk about some specifics on >>>> the project. >>>> Thank you again. >>>> >>>> On Mon, Mar 27, 2017 at 2:25 PM, Robert Slater >>>> wrote: >>>> >>>>> You definitely can use some of the tools in sci-kit learn for >>>>> supervised machine learning. The real trick will be how well your training >>>>> system is representative of your future predictions. All of the various >>>>> regression algorithms would be of some value and you make even consider an >>>>> ensemble to help generalize. There will be some important questions to >>>>> answer--what kind of loss function do you want to look at? I assumed >>>>> regression (continuous response) but it could also classify--paramagnetic, >>>>> diamagnetic, ferromagnetic, etc... >>>>> >>>>> Another task to think about might be dimension reduction. >>>>> There is no guarantee you will get fantastic results--every problem is >>>>> unique and much will depend on exactly what you want out of the >>>>> solution--it may be that we get '10%' accuracy at best--for some systems >>>>> that is quite good, others it is horrible. >>>>> >>>>> If you'd like to talk specifics, feel free to contact me at this >>>>> email. I have a background in magnetism (PhD in magnetic multilayers--i >>>>> was physics, but as you are probably aware chemisty and physics blend in >>>>> this area) and have a fairly good knowledge of sci-kit learn and machine >>>>> learning. >>>>> >>>>> >>>>> >>>>> On Mon, Mar 27, 2017 at 10:50 AM, Henrique C. S. Junior < >>>>> henriquecsj at gmail.com> wrote: >>>>> >>>>>> I'm a chemist with some rudimentary programming skills (getting >>>>>> started with python) and in the middle of the year I'll be starting a Ph.D. >>>>>> project that uses computers to describe magnetism in molecular systems. >>>>>> >>>>>> Most of the time I get my results after several simulations and >>>>>> experiments, so, I know that one of the hardest tasks in molecular >>>>>> magnetism is to predict the nature of magnetic interactions. That's why >>>>>> I'll try to tackle this problem with Machine Learning (because such >>>>>> interactions are dependent, basically, of distances, angles and number of >>>>>> unpaired electrons). The idea is to feed the computer with a large training >>>>>> set (with number of unpaired electrons, XYZ coordinates of each molecule >>>>>> and experimental magnetic couplings) and see if it can predict the magnetic >>>>>> couplings (J(AB)) of new systems: >>>>>> (see example in the attached image) >>>>>> >>>>>> Can Scikit-Learn handle the task, knowing that the matrix used to >>>>>> represent atomic coordinates will probably have a different number of atoms >>>>>> (because some molecules have more atoms than others)? Or is this a job >>>>>> better suited for another software/approach? ? >>>>>> >>>>>> >>>>>> -- >>>>>> *Henrique C. S. Junior* >>>>>> Industrial Chemist - UFRRJ >>>>>> M. Sc. Inorganic Chemistry - UFRRJ >>>>>> Data Processing Center - PMP >>>>>> Visite o Mundo Qu?mico >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> >>>> -- >>>> *Henrique C. S. Junior* >>>> Industrial Chemist - UFRRJ >>>> M. Sc. Inorganic Chemistry - UFRRJ >>>> Data Processing Center - PMP >>>> Visite o Mundo Qu?mico >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> >>> -- >>> Please do NOT send Microsoft Office Attachments: >>> http://www.gnu.org/philosophy/no-word-attachments.html >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> *Henrique C. S. Junior* >> Industrial Chemist - UFRRJ >> M. Sc. Inorganic Chemistry - UFRRJ >> Data Processing Center - PMP >> Visite o Mundo Qu?mico >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Please do NOT send Microsoft Office Attachments: > http://www.gnu.org/philosophy/no-word-attachments.html > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- *Henrique C. S. Junior* Industrial Chemist - UFRRJ M. Sc. Inorganic Chemistry - UFRRJ Data Processing Center - PMP Visite o Mundo Qu?mico _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From ahowe42 at gmail.com Wed Mar 29 03:21:20 2017 From: ahowe42 at gmail.com (Andrew Howe) Date: Wed, 29 Mar 2017 10:21:20 +0300 Subject: [scikit-learn] decision trees Message-ID: Is one-hot encoding still the most accurate way to pass categorical variables to decision trees in scikit-learn (i.e. without causing spurious ordering/interpolation)? Thanks. Andrew <~~~~~~~~~~~~~~~~~~~~~~~~~~~> J. Andrew Howe, PhD www.andrewhowe.com http://www.linkedin.com/in/ahowe42 https://www.researchgate.net/profile/John_Howe12/ I live to learn, so I can learn to live. - me <~~~~~~~~~~~~~~~~~~~~~~~~~~~> -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Wed Mar 29 03:32:39 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Wed, 29 Mar 2017 09:32:39 +0200 Subject: [scikit-learn] decision trees In-Reply-To: References: Message-ID: For large enough models (e.g. random forests or gradient boosted trees ensembles) I would definitely recommend arbitrary integer coding for the categorical variables. Try both, use cross-validation and see for yourself. -- Olivier From ahowe42 at gmail.com Wed Mar 29 03:38:11 2017 From: ahowe42 at gmail.com (Andrew Howe) Date: Wed, 29 Mar 2017 10:38:11 +0300 Subject: [scikit-learn] decision trees In-Reply-To: References: Message-ID: My question is more along the lines of will the DT classifier falsely infer an ordering? <~~~~~~~~~~~~~~~~~~~~~~~~~~~> J. Andrew Howe, PhD www.andrewhowe.com http://www.linkedin.com/in/ahowe42 https://www.researchgate.net/profile/John_Howe12/ I live to learn, so I can learn to live. - me <~~~~~~~~~~~~~~~~~~~~~~~~~~~> On Wed, Mar 29, 2017 at 10:32 AM, Olivier Grisel wrote: > For large enough models (e.g. random forests or gradient boosted trees > ensembles) I would definitely recommend arbitrary integer coding for > the categorical variables. > > Try both, use cross-validation and see for yourself. > > -- > Olivier > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bdholt1 at gmail.com Wed Mar 29 04:52:11 2017 From: bdholt1 at gmail.com (Brian Holt) Date: Wed, 29 Mar 2017 09:52:11 +0100 Subject: [scikit-learn] decision trees In-Reply-To: References: Message-ID: >From a theoretical point of view, yes you should one-hot-encode your categorical variables if you don't want any ordering to be implied. Brian On 29 Mar 2017 08:40, "Andrew Howe" wrote: > My question is more along the lines of will the DT classifier falsely > infer an ordering? > > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > J. Andrew Howe, PhD > www.andrewhowe.com > http://www.linkedin.com/in/ahowe42 > https://www.researchgate.net/profile/John_Howe12/ > I live to learn, so I can learn to live. - me > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > > On Wed, Mar 29, 2017 at 10:32 AM, Olivier Grisel > wrote: > >> For large enough models (e.g. random forests or gradient boosted trees >> ensembles) I would definitely recommend arbitrary integer coding for >> the categorical variables. >> >> Try both, use cross-validation and see for yourself. >> >> -- >> Olivier >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Wed Mar 29 05:56:38 2017 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Wed, 29 Mar 2017 11:56:38 +0200 Subject: [scikit-learn] decision trees In-Reply-To: References: Message-ID: Integer coding will indeed make the DT assume an arbitrary ordering while one-hot encoding does not force the tree model to make that assumption. However in practice when the depth of the trees is not too limited (or if you use a large enough ensemble of trees), the model will have enough flexibility to introduce as many splits as necessary to isolate individual categories in the integer and therefore the arbitrary ordering assumption is not a problem. On the other hand using one-hot encoding can introduce a detrimental inductive bias on random forests: random forest uses uniform random feature sampling when deciding which feature to split on (e.g. pick the best split out of 25% of the features selected at random). Let's consider the following example: assume you have an heterogeneously typed dataset with 99 numeric features and 1 categorical feature with categorical cardinality 1000 (1000 possible values for that features): - the RF will have one chance in 100 to pick each feature (categorical or numerical) as a candidate for the next split when using integer coding, - the RF will have 0.1% chance of picking each numerical feature and 99% chance to select a candidate feature split on a category of the unique categorical feature when using one-hot encoding. The inductive bias of one-encoding on RFs can therefore completely break the feature balancing. The feature encoding will also impact the inductive bias with respect the importance of the depth of the trees, even when feature splits are selected fully deterministically. Finally one-hot encoding features with large categorical cardinalities will be much slower then when using naive integer coding. TL;DNR: naive theoretical analysis based only on the ordering assumption can be misleading. Inductive biases of each feature encoding are more complex to evaluate. Use cross-validation to decide which is the best on your problem. Don't ignore computational considerations (CPU and memory usage). -- Olivier From ahowe42 at gmail.com Wed Mar 29 06:46:46 2017 From: ahowe42 at gmail.com (Andrew Howe) Date: Wed, 29 Mar 2017 13:46:46 +0300 Subject: [scikit-learn] decision trees In-Reply-To: References: Message-ID: Thanks very much for the thorough answer. I didn't think about the inductive bias issue with my forests. I'll evaluate both set of coding for my unordered categoricals. Andrew <~~~~~~~~~~~~~~~~~~~~~~~~~~~> J. Andrew Howe, PhD www.andrewhowe.com http://www.linkedin.com/in/ahowe42 https://www.researchgate.net/profile/John_Howe12/ I live to learn, so I can learn to live. - me <~~~~~~~~~~~~~~~~~~~~~~~~~~~> On Wed, Mar 29, 2017 at 12:56 PM, Olivier Grisel wrote: > Integer coding will indeed make the DT assume an arbitrary ordering > while one-hot encoding does not force the tree model to make that > assumption. > > However in practice when the depth of the trees is not too limited (or > if you use a large enough ensemble of trees), the model will have > enough flexibility to introduce as many splits as necessary to isolate > individual categories in the integer and therefore the arbitrary > ordering assumption is not a problem. > > On the other hand using one-hot encoding can introduce a detrimental > inductive bias on random forests: random forest uses uniform random > feature sampling when deciding which feature to split on (e.g. pick > the best split out of 25% of the features selected at random). > > Let's consider the following example: assume you have an > heterogeneously typed dataset with 99 numeric features and 1 > categorical feature with categorical cardinality 1000 (1000 possible > values for that features): > > - the RF will have one chance in 100 to pick each feature (categorical > or numerical) as a candidate for the next split when using integer > coding, > - the RF will have 0.1% chance of picking each numerical feature and > 99% chance to select a candidate feature split on a category of the > unique categorical feature when using one-hot encoding. > > The inductive bias of one-encoding on RFs can therefore completely > break the feature balancing. The feature encoding will also impact the > inductive bias with respect the importance of the depth of the trees, > even when feature splits are selected fully deterministically. > > Finally one-hot encoding features with large categorical cardinalities > will be much slower then when using naive integer coding. > > TL;DNR: naive theoretical analysis based only on the ordering > assumption can be misleading. Inductive biases of each feature > encoding are more complex to evaluate. Use cross-validation to decide > which is the best on your problem. Don't ignore computational > considerations (CPU and memory usage). > > -- > Olivier > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vaggi.federico at gmail.com Wed Mar 29 06:50:51 2017 From: vaggi.federico at gmail.com (federico vaggi) Date: Wed, 29 Mar 2017 10:50:51 +0000 Subject: [scikit-learn] decision trees In-Reply-To: References: Message-ID: That's a really good point. Do you know of any systematic studies about the two different encodings? Finally: wasn't there a PR for RF to accept categorical variables as inputs? On Wed, 29 Mar 2017 at 11:57, Olivier Grisel wrote: > Integer coding will indeed make the DT assume an arbitrary ordering > while one-hot encoding does not force the tree model to make that > assumption. > > However in practice when the depth of the trees is not too limited (or > if you use a large enough ensemble of trees), the model will have > enough flexibility to introduce as many splits as necessary to isolate > individual categories in the integer and therefore the arbitrary > ordering assumption is not a problem. > > On the other hand using one-hot encoding can introduce a detrimental > inductive bias on random forests: random forest uses uniform random > feature sampling when deciding which feature to split on (e.g. pick > the best split out of 25% of the features selected at random). > > Let's consider the following example: assume you have an > heterogeneously typed dataset with 99 numeric features and 1 > categorical feature with categorical cardinality 1000 (1000 possible > values for that features): > > - the RF will have one chance in 100 to pick each feature (categorical > or numerical) as a candidate for the next split when using integer > coding, > - the RF will have 0.1% chance of picking each numerical feature and > 99% chance to select a candidate feature split on a category of the > unique categorical feature when using one-hot encoding. > > The inductive bias of one-encoding on RFs can therefore completely > break the feature balancing. The feature encoding will also impact the > inductive bias with respect the importance of the depth of the trees, > even when feature splits are selected fully deterministically. > > Finally one-hot encoding features with large categorical cardinalities > will be much slower then when using naive integer coding. > > TL;DNR: naive theoretical analysis based only on the ordering > assumption can be misleading. Inductive biases of each feature > encoding are more complex to evaluate. Use cross-validation to decide > which is the best on your problem. Don't ignore computational > considerations (CPU and memory usage). > > -- > Olivier > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From drraph at gmail.com Wed Mar 29 06:57:59 2017 From: drraph at gmail.com (Raphael C) Date: Wed, 29 Mar 2017 11:57:59 +0100 Subject: [scikit-learn] decision trees In-Reply-To: References: Message-ID: There is https://github.com/scikit-learn/scikit-learn/pull/4899 . It looks like it is waiting for review? Raphael On 29 March 2017 at 11:50, federico vaggi wrote: > That's a really good point. Do you know of any systematic studies about the > two different encodings? > > Finally: wasn't there a PR for RF to accept categorical variables as inputs? > > On Wed, 29 Mar 2017 at 11:57, Olivier Grisel > wrote: >> >> Integer coding will indeed make the DT assume an arbitrary ordering >> while one-hot encoding does not force the tree model to make that >> assumption. >> >> However in practice when the depth of the trees is not too limited (or >> if you use a large enough ensemble of trees), the model will have >> enough flexibility to introduce as many splits as necessary to isolate >> individual categories in the integer and therefore the arbitrary >> ordering assumption is not a problem. >> >> On the other hand using one-hot encoding can introduce a detrimental >> inductive bias on random forests: random forest uses uniform random >> feature sampling when deciding which feature to split on (e.g. pick >> the best split out of 25% of the features selected at random). >> >> Let's consider the following example: assume you have an >> heterogeneously typed dataset with 99 numeric features and 1 >> categorical feature with categorical cardinality 1000 (1000 possible >> values for that features): >> >> - the RF will have one chance in 100 to pick each feature (categorical >> or numerical) as a candidate for the next split when using integer >> coding, >> - the RF will have 0.1% chance of picking each numerical feature and >> 99% chance to select a candidate feature split on a category of the >> unique categorical feature when using one-hot encoding. >> >> The inductive bias of one-encoding on RFs can therefore completely >> break the feature balancing. The feature encoding will also impact the >> inductive bias with respect the importance of the depth of the trees, >> even when feature splits are selected fully deterministically. >> >> Finally one-hot encoding features with large categorical cardinalities >> will be much slower then when using naive integer coding. >> >> TL;DNR: naive theoretical analysis based only on the ordering >> assumption can be misleading. Inductive biases of each feature >> encoding are more complex to evaluate. Use cross-validation to decide >> which is the best on your problem. Don't ignore computational >> considerations (CPU and memory usage). >> >> -- >> Olivier >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From t3kcit at gmail.com Wed Mar 29 10:30:21 2017 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 29 Mar 2017 10:30:21 -0400 Subject: [scikit-learn] decision trees In-Reply-To: References: Message-ID: <3e0d8739-c674-b085-deff-d9a20d5bc696@gmail.com> I'd argue that's why we should implement conditional inference trees ;) On 03/29/2017 05:56 AM, Olivier Grisel wrote: > Integer coding will indeed make the DT assume an arbitrary ordering > while one-hot encoding does not force the tree model to make that > assumption. > > However in practice when the depth of the trees is not too limited (or > if you use a large enough ensemble of trees), the model will have > enough flexibility to introduce as many splits as necessary to isolate > individual categories in the integer and therefore the arbitrary > ordering assumption is not a problem. > > On the other hand using one-hot encoding can introduce a detrimental > inductive bias on random forests: random forest uses uniform random > feature sampling when deciding which feature to split on (e.g. pick > the best split out of 25% of the features selected at random). > > Let's consider the following example: assume you have an > heterogeneously typed dataset with 99 numeric features and 1 > categorical feature with categorical cardinality 1000 (1000 possible > values for that features): > > - the RF will have one chance in 100 to pick each feature (categorical > or numerical) as a candidate for the next split when using integer > coding, > - the RF will have 0.1% chance of picking each numerical feature and > 99% chance to select a candidate feature split on a category of the > unique categorical feature when using one-hot encoding. > > The inductive bias of one-encoding on RFs can therefore completely > break the feature balancing. The feature encoding will also impact the > inductive bias with respect the importance of the depth of the trees, > even when feature splits are selected fully deterministically. > > Finally one-hot encoding features with large categorical cardinalities > will be much slower then when using naive integer coding. > > TL;DNR: naive theoretical analysis based only on the ordering > assumption can be misleading. Inductive biases of each feature > encoding are more complex to evaluate. Use cross-validation to decide > which is the best on your problem. Don't ignore computational > considerations (CPU and memory usage). > From julio at esbet.es Wed Mar 29 13:04:40 2017 From: julio at esbet.es (Julio Antonio Soto de Vicente) Date: Wed, 29 Mar 2017 18:04:40 +0100 Subject: [scikit-learn] decision trees In-Reply-To: <3e0d8739-c674-b085-deff-d9a20d5bc696@gmail.com> References: <3e0d8739-c674-b085-deff-d9a20d5bc696@gmail.com> Message-ID: IMO CART can handle categorical features just as good as CITrees, as long as we slightly change sklearn's implementation... -- Julio > El 29 mar 2017, a las 15:30, Andreas Mueller escribi?: > > I'd argue that's why we should implement conditional inference trees ;) > >> On 03/29/2017 05:56 AM, Olivier Grisel wrote: >> Integer coding will indeed make the DT assume an arbitrary ordering >> while one-hot encoding does not force the tree model to make that >> assumption. >> >> However in practice when the depth of the trees is not too limited (or >> if you use a large enough ensemble of trees), the model will have >> enough flexibility to introduce as many splits as necessary to isolate >> individual categories in the integer and therefore the arbitrary >> ordering assumption is not a problem. >> >> On the other hand using one-hot encoding can introduce a detrimental >> inductive bias on random forests: random forest uses uniform random >> feature sampling when deciding which feature to split on (e.g. pick >> the best split out of 25% of the features selected at random). >> >> Let's consider the following example: assume you have an >> heterogeneously typed dataset with 99 numeric features and 1 >> categorical feature with categorical cardinality 1000 (1000 possible >> values for that features): >> >> - the RF will have one chance in 100 to pick each feature (categorical >> or numerical) as a candidate for the next split when using integer >> coding, >> - the RF will have 0.1% chance of picking each numerical feature and >> 99% chance to select a candidate feature split on a category of the >> unique categorical feature when using one-hot encoding. >> >> The inductive bias of one-encoding on RFs can therefore completely >> break the feature balancing. The feature encoding will also impact the >> inductive bias with respect the importance of the depth of the trees, >> even when feature splits are selected fully deterministically. >> >> Finally one-hot encoding features with large categorical cardinalities >> will be much slower then when using naive integer coding. >> >> TL;DNR: naive theoretical analysis based only on the ordering >> assumption can be misleading. Inductive biases of each feature >> encoding are more complex to evaluate. Use cross-validation to decide >> which is the best on your problem. Don't ignore computational >> considerations (CPU and memory usage). >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From jni.soma at gmail.com Wed Mar 29 14:22:10 2017 From: jni.soma at gmail.com (Juan Nunez-Iglesias) Date: Wed, 29 Mar 2017 18:22:10 +0000 Subject: [scikit-learn] Announcement: scikit-image 0.13.0 Message-ID: We're happy to (finally) announce the release of scikit-image v0.13.0! Special thanks to our many contributors for making it possible! This release is the result of over a year of work, with over 200 pull requests by 82 contributors. Linux and macOS wheels are available now on PyPI , together with a source distribution. Use "pip install -U scikit-image" to get the latest version! Packages on conda-forge, Windows wheels, and Debian packages should be available within the next few days. scikit-image is an image processing toolbox for SciPy that includes algorithms for segmentation, geometric transformations, color space manipulation, analysis, filtering, morphology, feature detection, and more. For more information, examples, and documentation, please visit our website: http://scikit-image.org and our gallery of examples http://scikit-image.org/docs/dev/auto_examples/ Highlights ---------- - Improved n-dimensional image support. This release adds nD support to: * ``regionprops`` computation for centroids (#2083) * ``segmentation.clear_border`` (#2087) * Hessian matrix (#2194) - In addition, the following new functions support nD images: * new wavelet denoising function, ``restoration.denoise_wavelet`` (#1833, #2190, #2238, #2240, #2241, #2242, #2462) * new thresholding functions, ``filters.threshold_sauvola`` and ``filters.threshold_niblack`` (#2266, #2441) * new local maximum, local minimum, hmaxima, hminima functions (#2449) - Grey level co-occurrence matrix (GLCM) now works with uint16 images - ``filters.try_all_threshold`` to rapidly see output of various thresholding methods - Frangi and Hessian filters (2D only) (#2153) - New *compact watershed* algorithm in ``segmentation.watershed`` (#2211) - New *shape index* algorithm in ``feature.shape_index`` (#2312) New functions and features -------------------------- - Add threshold minimum algorithm (#2104) - Implement mean and triangle thresholding (#2126) - Add Frangi and Hessian filters (#2153) - add bbox_area to region properties (#2187) - colorconv: Add rgba2rgb() (#2181) - Lewiner marching cubes algorithm (#2052) - image inversion (#2199) - wavelet denoising (from #1833) (#2190) - routine to estimate the noise standard deviation from an image (#1837) - Add compact watershed and clean up existing watershed (#2211) - Added the missing 'grey2rgb' function. (#2316) - Shape index (#2312) - Fundamental and essential matrix 8-point algorithm (#1357) - Add YUV, YIQ, YPbPr, YCbCr colorspaces - Detection of local extrema from morphology (#2449) - shannon entropy (#2416) Documentation improvements -------------------------- - add details about github SSH keys in contributing page (#2073) - Add example for felzenszwalb image segmentation (#2096) - Sphinx gallery for example gallery (#2078) - Improved region boundary RAG docs (#2106) - Add gallery Lucy-Richardson deconvolution algorithm (#2376) - Gallery: Use Horse to illustrate Convex Hull (#2431) - Add working with OpenCV in user guide (#2519) Code improvements ----------------- - Remove lena image from test suite (#1985) - Remove duplicate mean calculation in skimage.feature.match_template (#1980) - Add nD support to clear_border (#2087) - Add uint16 images support for co-occurrence matrix (#2095) - Add default parameters for Gaussian and median filters (#2151) - try_all to choose the best threshold algorithm (#2110) - Add support for multichannel in Felzenszwalb segmentation (#2134) - Improved SimilarityTransform, new EuclideanTransform class (#2044) - ENH: Speed up Hessian matrix computation (#2194) - add n-dimensional support to denoise_wavelet (#2242) - Speedup ``inpaint_biharmonic`` (#2234) - Update hessian matrix code to include order kwarg (#2327) - Handle cases for label2rgb where input labels are negative and/or nonconsecutive (#2370) - Added watershed_line parameter (#2393) API Changes ----------- - Remove deprecated ``filter`` module. Use ``filters`` instead. (#2023) - Remove ``skimage.filters.canny`` links. Use ``feature.canny`` instead. (#2024) - Removed Python 2.6 support and related checks (#2033) - Remove deprecated {h/v}sobel, {h/v}prewitt, {h/v}scharr, roberts_{positive/negative} filters (#2159) - Remove deprecated ``_mode_deprecations`` (#2156) - Remove deprecated None defaults in ``rescale_intensity`` (#2161) - Parameters ``ntiles_x`` and ``ntiles_y`` have been removed from ``exposure.equalize_adapthist`` - The minimum NumPy version is now 1.11, and the minimum SciPy version is now 0.17 Deprecations ------------ - clip_negative will be set to false by default in version 0.15 (func: dtype_limits) (#2228) - Deprecate "dynamic_range" in favor of "data_range" (#2384) - The default value of the ``circle`` argument to ``radon`` and ``iradon`` transforms will be ``True`` in 0.15 (#2235) - The default value of ``multichannel`` for ``denoise_bilateral`` and ``denoise_nl_means`` will be ``False`` in 0.15 - The default value of ``block_norm`` in ``feature.hog`` will be L2-Hysteresis in 0.15. - The ``threshold_adaptive`` function is deprecated. Use ``threshold_local`` instead. - The default value of ``mode`` in ``transform.swirl``, ``resize``, and ``rescale`` will be "reflect" in 0.15. For a complete list of contributors and pull requests merged in this release, please see our release notes online: https://github.com/scikit-image/scikit-image/blob/master/doc/release/release_0.13.rst Please spread the word, including on Twitter ! Enjoy! Juan. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jni.soma at gmail.com Wed Mar 29 13:44:19 2017 From: jni.soma at gmail.com (Juan Nunez-Iglesias) Date: Wed, 29 Mar 2017 13:44:19 -0400 Subject: [scikit-learn] Announcement: scikit-image 0.13.0 Message-ID: <5302d3ae-73b4-4458-a61f-b36a37d1792a@Spark> We're happy to (finally) announce the release of scikit-image v0.13.0! Special thanks to all our contributors who made this possible. Linux and macOS wheels are available now on PyPI, as well as a source distribution. A conda-forge package, Windows wheels, and Debian packages should arrive in the coming days. scikit-image is an image processing toolbox for SciPy that includes algorithms for segmentation, geometric transformations, color space manipulation, analysis, filtering, morphology, feature detection, and more. For more information, examples, and documentation, please visit our website: http://scikit-image.org and our gallery of examples http://scikit-image.org/docs/dev/auto_examples/ Highlights ---------- This release is the result of a year of work, with over 200 pull requests by 82 contributors. Highlights include: - Improved n-dimensional image support. This release adds nD support to: ? * ``regionprops`` computation for centroids (#2083) ? * ``segmentation.clear_border`` (#2087) ? * Hessian matrix (#2194, #2327) - In addition, the following new functions support nD images: ? * new wavelet denoising function, ``restoration.denoise_wavelet`` (#1833, #2190, #2238, #2240, #2241, #2242, #2462) ? * new thresholding functions, ``filters.threshold_sauvola`` and ``filters.threshold_niblack`` (#2266, #2441) ? * new local maximum, local minimum, hmaxima, hminima functions (#2449) - Grey level co-occurrence matrix (GLCM) now works with uint16 images - ``filters.try_all_threshold`` to rapidly see output of various thresholding methods - Frangi and Hessian filters (2D only) (#2153) - New *compact watershed* algorithm in ``segmentation.watershed`` (#2211) - New *shape index* algorithm in ``feature.shape_index`` (#2312) New functions and features -------------------------- - Add threshold minimum algorithm (#2104) - Implement mean and triangle thresholding (#2126) - Add Frangi and Hessian filters (#2153) - add bbox_area to region properties (#2187) - colorconv: Add rgba2rgb() (#2181) - Lewiner marching cubes algorithm (#2052) - image inversion (#2199) - wavelet denoising (from #1833) (#2190) - routine to estimate the noise standard deviation from an image (#1837) - Add compact watershed and clean up existing watershed (#2211) - Added the missing 'grey2rgb' function. (#2316) - Shape index (#2312) - Fundamental and essential matrix 8-point algorithm (#1357) - Add YUV, YIQ, YPbPr, YCbCr colorspaces - Detection of local extrema from morphology (#2449) - shannon entropy (#2416) Documentation improvements -------------------------- - add details about github SSH keys in contributing page (#2073) - Add example for felzenszwalb image segmentation (#2096) - Sphinx gallery for example gallery (#2078) - Improved region boundary RAG docs (#2106) - Add gallery Lucy-Richardson deconvolution algorithm (#2376) - Gallery: Use Horse to illustrate Convex Hull (#2431) - Add working with OpenCV in user guide (#2519) Code improvements ----------------- - Remove lena image from test suite (#1985) - Remove duplicate mean calculation in skimage.feature.match_template (#1980) - Add nD support to clear_border (#2087) - Add uint16 images support for co-occurrence matrix (#2095) - Add default parameters for Gaussian and median filters (#2151) - try_all to choose the best threshold algorithm (#2110) - Add support for multichannel in Felzenszwalb segmentation (#2134) - Improved SimilarityTransform, new EuclideanTransform class (#2044) - ENH: Speed up Hessian matrix computation (#2194) - add n-dimensional support to denoise_wavelet (#2242) - Speedup ``inpaint_biharmonic`` (#2234) - Update hessian matrix code to include order kwarg (#2327) - Handle cases for label2rgb where input labels are negative and/or ? nonconsecutive (#2370) - Added watershed_line parameter (#2393) API Changes ----------- - Remove deprecated ``filter`` module. Use ``filters`` instead. (#2023) - Remove ``skimage.filters.canny`` links. Use ``feature.canny`` instead. (#2024) - Removed Python 2.6 support and related checks (#2033) - Remove deprecated {h/v}sobel, {h/v}prewitt, {h/v}scharr, roberts_{positive/negative} filters (#2159) - Remove deprecated ``_mode_deprecations`` (#2156) - Remove deprecated None defaults in ``rescale_intensity`` (#2161) - Parameters ``ntiles_x`` and ``ntiles_y`` have been removed from ``exposure.equalize_adapthist`` - The minimum NumPy version is now 1.11, and the minimum SciPy version is now 0.17 Deprecations ------------ - clip_negative will be set to false by default in version 0.15 ? (func: dtype_limits) (#2228) - Deprecate "dynamic_range" in favor of "data_range" (#2384) - The default value of the ``circle`` argument to ``radon`` and ``iradon`` transforms will be ``True`` in 0.15 (#2235) - The default value of ``multichannel`` for ``denoise_bilateral`` and ``denoise_nl_means`` will be ``False`` in 0.15 - The default value of ``block_norm`` in ``feature.hog`` will be L2-Hysteresis in 0.15. - The ``threshold_adaptive`` function is deprecated. Use ``threshold_local`` instead. - The default value of ``mode`` in ``transform.swirl``, ``resize``, and ``rescale`` will be "reflect" in 0.15. For a complete list of contributors to this release, and PRs merged, please see the online release notes: https://github.com/scikit-image/scikit-image/blob/master/doc/release/release_0.13.rst Enjoy! -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Thu Mar 30 00:45:41 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Wed, 29 Mar 2017 21:45:41 -0700 Subject: [scikit-learn] GSoC proposal - linear model In-Reply-To: References: Message-ID: Hi Konstantinos I likely won't be a mentor for the linear models project, but I looked over your proposal and have a few suggestions. In general it was a good write up! 1. You should include some equations in the write up, basically the softmax loss (which I think is a more common term than multinomial logistic loss) and the AdaGrad update. 2. You may want to indicate which files in the codebase you'll be modifying, or if you'll be adding a new file. That will show us you're familiar with our existing code. 3. You should give more time for the cython implementation of these methods. It's not that easy to do, especially if you don't have background experience. You can easily lose a day or two from a dumb memory error that has nothing to do with if you understand the equations. 4. You might want to also implement ADAM if time permits. It's another optimizer that is popular. I'm not sure how popular it is in linear models but I've seen it used effectively, and once you get AdaGrad it should be easier to implement a second optimizer. Good luck! Jacob On Mon, Mar 27, 2017 at 10:43 AM, Konstantinos Katrioplas < konst.katrioplas at gmail.com> wrote: > Dear all, > > here is a draft of my proposal > on > improving online learning for linear models with softmax and AdaGrad. > I look forward to your feedback, > Konstantinos > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shuchi.23 at gmail.com Thu Mar 30 04:51:54 2017 From: shuchi.23 at gmail.com (Shuchi Mala) Date: Thu, 30 Mar 2017 14:21:54 +0530 Subject: [scikit-learn] urgent help in scikit-learn Message-ID: Hi everyone, I have the data with following attributes: (Latitude, Longitude). Now I am performing clustering using DBSCAN for my data. I have following doubts: 1. How can I add data to the data set of the package? 2. How I can calculate Rand index for my data? 3. How to use make_blobs command for my data? Sample of my data is : Latitude Longitude 37.76901 -122.429299 37.76904 -122.42913 37.76878 -122.429092 37.7763 -122.424249 37.77627 -122.424657 With Best Regards, Shuchi Mala Research Scholar Department of Civil Engineering MNIT Jaipur -------------- next part -------------- An HTML attachment was scrubbed... URL: From yizhengz at andrew.cmu.edu Thu Mar 30 08:24:45 2017 From: yizhengz at andrew.cmu.edu (Yizheng Zhao) Date: Thu, 30 Mar 2017 05:24:45 -0700 Subject: [scikit-learn] GSoC 2017 Proposal: Improve online learning for linear models Message-ID: <5D8BCADF-6292-4B76-9F4C-619A7DD5F548@andrew.cmu.edu> Hi developers, It is excited that I have opportunity work with you! I am Yizheng Zhao, a graduate student at Carnegie Mellon University majoring in Software Engineering and I?ve got my Bachelor?s degree in Math in 2016 at Jilin University. I love python and machine learning and that why I wanna make my own contribution to community. I have 2 years experience developing with python and I am quite familiar with scikit-learn as a user. In college, I learned several machine learning algorithms and their mathematical derivation. I believe I can do better than others with my strong Math background and coding skills. Here is my proposal. https://github.com/YizhengZHAO/scikit-learn/wiki/GSoC-2017-:-Improve-online-learning-for-linear-models BTW, could you please give me more explanation about ?A tool to set the learning rate on a few epochs?? I am happy to get suggestions from the community. Sincerely, Yizheng -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Thu Mar 30 10:04:19 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Thu, 30 Mar 2017 10:04:19 -0400 Subject: [scikit-learn] urgent help in scikit-learn In-Reply-To: References: Message-ID: Hi, Shuchi, > 1. How can I add data to the data set of the package? You don?t need to add your dataset to the dataset module to run your analysis. A convenient way to load it into a numpy array would be via pandas. E.g., import pandas as pd df = pd.read_csv(?your_data.txt', delimiter=r"\s+?) X = df.values > 2. How I can calculate Rand index for my data? After you ran the clustering, you can use the ?adjusted_rand_score? function, e.g., see http://scikit-learn.org/stable/modules/clustering.html#adjusted-rand-score > 3. How to use make_blobs command for my data? The make_blobs command is just a utility function to create toydatasets, you wouldn?t need it in your case since you already have ?real? data. Best, Sebastian > On Mar 30, 2017, at 4:51 AM, Shuchi Mala wrote: > > Hi everyone, > > I have the data with following attributes: (Latitude, Longitude). Now I am performing clustering using DBSCAN for my data. I have following doubts: > > 1. How can I add data to the data set of the package? > 2. How I can calculate Rand index for my data? > 3. How to use make_blobs command for my data? > > Sample of my data is : > Latitude Longitude > 37.76901 -122.429299 > 37.76904 -122.42913 > 37.76878 -122.429092 > 37.7763 -122.424249 > 37.77627 -122.424657 > > > With Best Regards, > Shuchi Mala > Research Scholar > Department of Civil Engineering > MNIT Jaipur > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From shane.grigsby at colorado.edu Thu Mar 30 11:08:17 2017 From: shane.grigsby at colorado.edu (Shane Grigsby) Date: Thu, 30 Mar 2017 09:08:17 -0600 Subject: [scikit-learn] urgent help in scikit-learn In-Reply-To: References: Message-ID: <20170330150817.iu32sdchhadruk26@cu-vpn-colorado-edu-198.11.30.203.int.colorado.edu> Since you're using lat / long coords, you'll also want to convert them to radians and specify 'haversine' as your distance metric; i.e. : coords = np.vstack([lats.ravel(),longs.ravel()]).T coords *= np.pi / 180. # to radians ...and: db = DBSCAN(eps=0.3, min_samples=10, metric='haversine') # replace eps and min_samples as appropriate db.fit(coords) Cheers, Shane On 03/30, Sebastian Raschka wrote: >Hi, Shuchi, > >> 1. How can I add data to the data set of the package? > >You don?t need to add your dataset to the dataset module to run your analysis. A convenient way to load it into a numpy array would be via pandas. E.g., > >import pandas as pd >df = pd.read_csv(?your_data.txt', delimiter=r"\s+?) >X = df.values > >> 2. How I can calculate Rand index for my data? > >After you ran the clustering, you can use the ?adjusted_rand_score? function, e.g., see >http://scikit-learn.org/stable/modules/clustering.html#adjusted-rand-score > >> 3. How to use make_blobs command for my data? > >The make_blobs command is just a utility function to create toydatasets, you wouldn?t need it in your case since you already have ?real? data. > >Best, >Sebastian > > >> On Mar 30, 2017, at 4:51 AM, Shuchi Mala wrote: >> >> Hi everyone, >> >> I have the data with following attributes: (Latitude, Longitude). Now I am performing clustering using DBSCAN for my data. I have following doubts: >> >> 1. How can I add data to the data set of the package? >> 2. How I can calculate Rand index for my data? >> 3. How to use make_blobs command for my data? >> >> Sample of my data is : >> Latitude Longitude >> 37.76901 -122.429299 >> 37.76904 -122.42913 >> 37.76878 -122.429092 >> 37.7763 -122.424249 >> 37.77627 -122.424657 >> >> >> With Best Regards, >> Shuchi Mala >> Research Scholar >> Department of Civil Engineering >> MNIT Jaipur >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > >_______________________________________________ >scikit-learn mailing list >scikit-learn at python.org >https://mail.python.org/mailman/listinfo/scikit-learn -- *PhD candidate & Research Assistant* *Cooperative Institute for Research in Environmental Sciences (CIRES)* *University of Colorado at Boulder* From shuchi.23 at gmail.com Fri Mar 31 00:02:32 2017 From: shuchi.23 at gmail.com (Shuchi Mala) Date: Fri, 31 Mar 2017 09:32:32 +0530 Subject: [scikit-learn] urgent help in scikit-learn In-Reply-To: <20170330150817.iu32sdchhadruk26@cu-vpn-colorado-edu-198.11.30.203.int.colorado.edu> References: <20170330150817.iu32sdchhadruk26@cu-vpn-colorado-edu-198.11.30.203.int.colorado.edu> Message-ID: Thank you so much for your quick reply. I have one more doubt. The below statement is used to calculate rand score. metrics.adjusted_rand_score(labels_true, labels_pred) In my case what will be labels_true and labels_pred and how I will calculate labels_pred? With Best Regards, Shuchi Mala Research Scholar Department of Civil Engineering MNIT Jaipur On Thu, Mar 30, 2017 at 8:38 PM, Shane Grigsby wrote: > Since you're using lat / long coords, you'll also want to convert them to > radians and specify 'haversine' as your distance metric; i.e. : > > coords = np.vstack([lats.ravel(),longs.ravel()]).T > coords *= np.pi / 180. # to radians > > ...and: > > db = DBSCAN(eps=0.3, min_samples=10, metric='haversine') > # replace eps and min_samples as appropriate > db.fit(coords) > > Cheers, > Shane > > > On 03/30, Sebastian Raschka wrote: > >> Hi, Shuchi, >> >> 1. How can I add data to the data set of the package? >>> >> >> You don?t need to add your dataset to the dataset module to run your >> analysis. A convenient way to load it into a numpy array would be via >> pandas. E.g., >> >> import pandas as pd >> df = pd.read_csv(?your_data.txt', delimiter=r"\s+?) >> X = df.values >> >> 2. How I can calculate Rand index for my data? >>> >> >> After you ran the clustering, you can use the ?adjusted_rand_score? >> function, e.g., see >> http://scikit-learn.org/stable/modules/clustering.html# >> adjusted-rand-score >> >> 3. How to use make_blobs command for my data? >>> >> >> The make_blobs command is just a utility function to create toydatasets, >> you wouldn?t need it in your case since you already have ?real? data. >> >> Best, >> Sebastian >> >> >> On Mar 30, 2017, at 4:51 AM, Shuchi Mala wrote: >>> >>> Hi everyone, >>> >>> I have the data with following attributes: (Latitude, Longitude). Now I >>> am performing clustering using DBSCAN for my data. I have following doubts: >>> >>> 1. How can I add data to the data set of the package? >>> 2. How I can calculate Rand index for my data? >>> 3. How to use make_blobs command for my data? >>> >>> Sample of my data is : >>> Latitude Longitude >>> 37.76901 -122.429299 >>> 37.76904 -122.42913 >>> 37.76878 -122.429092 >>> 37.7763 -122.424249 >>> 37.77627 -122.424657 >>> >>> >>> With Best Regards, >>> Shuchi Mala >>> Research Scholar >>> Department of Civil Engineering >>> MNIT Jaipur >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > -- > *PhD candidate & Research Assistant* > *Cooperative Institute for Research in Environmental Sciences (CIRES)* > *University of Colorado at Boulder* > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From konst.katrioplas at gmail.com Fri Mar 31 03:44:06 2017 From: konst.katrioplas at gmail.com (Konstantinos Katrioplas) Date: Fri, 31 Mar 2017 10:44:06 +0300 Subject: [scikit-learn] GSoC proposal - linear model In-Reply-To: References: Message-ID: <006f0564-4fa9-47d1-5a82-40c138740808@gmail.com> Hello Jacob, Thanks a lot for your suggestions! I updated my proposal . I will add some minor details later today. Regarding the codebase, I am thinking about editing linear_model/sgd_fast.pyx for softmax and adding a new linear_model/sgd_opt.pyx perhaps for AdaGrad and Adam, I don't know if you agree on that. I admit I am a total beginner in Cython but I have time until June to practice. If there is real interest in the project and time to mentor my proposal please let me know. Ideally I would prefer not to leave it till the last day. Kind regards, Konstantinos On 30/03/2017 07:45 ??, Jacob Schreiber wrote: > Hi Konstantinos > > I likely won't be a mentor for the linear models project, but I looked > over your proposal and have a few suggestions. In general it was a > good write up! > > 1. You should include some equations in the write up, basically the > softmax loss (which I think is a more common term than multinomial > logistic loss) and the AdaGrad update. > 2. You may want to indicate which files in the codebase you'll be > modifying, or if you'll be adding a new file. That will show us you're > familiar with our existing code. > 3. You should give more time for the cython implementation of these > methods. It's not that easy to do, especially if you don't have > background experience. You can easily lose a day or two from a dumb > memory error that has nothing to do with if you understand the equations. > 4. You might want to also implement ADAM if time permits. It's another > optimizer that is popular. I'm not sure how popular it is in linear > models but I've seen it used effectively, and once you get AdaGrad it > should be easier to implement a second optimizer. > > Good luck! > Jacob > > On Mon, Mar 27, 2017 at 10:43 AM, Konstantinos Katrioplas > > wrote: > > Dear all, > > here is a draft of my proposal > on > improving online learning for linear models with softmax and AdaGrad. > > I look forward to your feedback, > Konstantinos > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From josecarlos.gomez at upf.edu Fri Mar 31 04:29:14 2017 From: josecarlos.gomez at upf.edu (GOMEZ TAMAYO, JOSE CARLOS) Date: Fri, 31 Mar 2017 10:29:14 +0200 Subject: [scikit-learn] Data type returned by PLSR different from other estimators Message-ID: Hi there, I have recently run into a problem when dealing with PLSR (and other cross-decomposition methods) prediction output. Unlike other estimators which return a numpy array containing the predictions, PLSR returns a list of single-value lists containing the predictions. I do not know why this done this way (perhaps there is a reason unknown to me) but the fact the estimator is returning a different data type should be advised in the documentation. It is easily solved through modifying the source code or just by overriding the method. Cheers, Jose Carlos G?mez -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Fri Mar 31 10:47:55 2017 From: se.raschka at gmail.com (Sebastian Raschka) Date: Fri, 31 Mar 2017 10:47:55 -0400 Subject: [scikit-learn] urgent help in scikit-learn In-Reply-To: References: <20170330150817.iu32sdchhadruk26@cu-vpn-colorado-edu-198.11.30.203.int.colorado.edu> Message-ID: <293EEA4E-2D51-4151-9A1F-D57CF628A71C@gmail.com> Hi, Shuchi, regarding labels_true: you?d only be able to compute the rand index adjusted for chance if you have the ground truth labels iof the training examples in your dataset. The second parameter, labels_pred, takes in the predicted cluster labels (indices) that you got from the clustering. E.g, dbscn = DBSCAN() labels_pred = dbscn.fit(X).predict(X) Best, Sebastian > On Mar 31, 2017, at 12:02 AM, Shuchi Mala wrote: > > Thank you so much for your quick reply. I have one more doubt. The below statement is used to calculate rand score. > > metrics.adjusted_rand_score(labels_true, labels_pred) > In my case what will be labels_true and labels_pred and how I will calculate labels_pred? > > With Best Regards, > Shuchi Mala > Research Scholar > Department of Civil Engineering > MNIT Jaipur > > > On Thu, Mar 30, 2017 at 8:38 PM, Shane Grigsby wrote: > Since you're using lat / long coords, you'll also want to convert them to radians and specify 'haversine' as your distance metric; i.e. : > > coords = np.vstack([lats.ravel(),longs.ravel()]).T > coords *= np.pi / 180. # to radians > > ...and: > > db = DBSCAN(eps=0.3, min_samples=10, metric='haversine') > # replace eps and min_samples as appropriate > db.fit(coords) > > Cheers, > Shane > > > On 03/30, Sebastian Raschka wrote: > Hi, Shuchi, > > 1. How can I add data to the data set of the package? > > You don?t need to add your dataset to the dataset module to run your analysis. A convenient way to load it into a numpy array would be via pandas. E.g., > > import pandas as pd > df = pd.read_csv(?your_data.txt', delimiter=r"\s+?) > X = df.values > > 2. How I can calculate Rand index for my data? > > After you ran the clustering, you can use the ?adjusted_rand_score? function, e.g., see > http://scikit-learn.org/stable/modules/clustering.html#adjusted-rand-score > > 3. How to use make_blobs command for my data? > > The make_blobs command is just a utility function to create toydatasets, you wouldn?t need it in your case since you already have ?real? data. > > Best, > Sebastian > > > On Mar 30, 2017, at 4:51 AM, Shuchi Mala wrote: > > Hi everyone, > > I have the data with following attributes: (Latitude, Longitude). Now I am performing clustering using DBSCAN for my data. I have following doubts: > > 1. How can I add data to the data set of the package? > 2. How I can calculate Rand index for my data? > 3. How to use make_blobs command for my data? > > Sample of my data is : > Latitude Longitude > 37.76901 -122.429299 > 37.76904 -122.42913 > 37.76878 -122.429092 > 37.7763 -122.424249 > 37.77627 -122.424657 > > > With Best Regards, > Shuchi Mala > Research Scholar > Department of Civil Engineering > MNIT Jaipur > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- > *PhD candidate & Research Assistant* > *Cooperative Institute for Research in Environmental Sciences (CIRES)* > *University of Colorado at Boulder* > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From jmschreiber91 at gmail.com Fri Mar 31 19:19:26 2017 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Fri, 31 Mar 2017 23:19:26 +0000 Subject: [scikit-learn] GSoC proposal - linear model In-Reply-To: <006f0564-4fa9-47d1-5a82-40c138740808@gmail.com> References: <006f0564-4fa9-47d1-5a82-40c138740808@gmail.com> Message-ID: Hi Konstantinos Thanks for the changes. You should go ahead and submit if you're happy with the proposal, it's unlikely that the decision will come down to details. Jacob On Fri, Mar 31, 2017 at 12:44 AM Konstantinos Katrioplas < konst.katrioplas at gmail.com> wrote: > Hello Jacob, > > Thanks a lot for your suggestions! I updated my proposal > . I > will add some minor details later today. > > Regarding the codebase, I am thinking about editing > linear_model/sgd_fast.pyx for softmax and adding a new > linear_model/sgd_opt.pyx perhaps for AdaGrad and Adam, I don't know if you > agree on that. > > I admit I am a total beginner in Cython but I have time until June to > practice. > If there is real interest in the project and time to mentor my proposal > please let me know. Ideally I would prefer not to leave it till the last > day. > > Kind regards, > Konstantinos > > > > > On 30/03/2017 07:45 ??, Jacob Schreiber wrote: > > Hi Konstantinos > > I likely won't be a mentor for the linear models project, but I looked > over your proposal and have a few suggestions. In general it was a good > write up! > > 1. You should include some equations in the write up, basically the > softmax loss (which I think is a more common term than multinomial logistic > loss) and the AdaGrad update. > 2. You may want to indicate which files in the codebase you'll be > modifying, or if you'll be adding a new file. That will show us you're > familiar with our existing code. > 3. You should give more time for the cython implementation of these > methods. It's not that easy to do, especially if you don't have background > experience. You can easily lose a day or two from a dumb memory error that > has nothing to do with if you understand the equations. > 4. You might want to also implement ADAM if time permits. It's another > optimizer that is popular. I'm not sure how popular it is in linear models > but I've seen it used effectively, and once you get AdaGrad it should be > easier to implement a second optimizer. > > Good luck! > Jacob > > On Mon, Mar 27, 2017 at 10:43 AM, Konstantinos Katrioplas < > konst.katrioplas at gmail.com> wrote: > > Dear all, > > here is a draft of my proposal > on > improving online learning for linear models with softmax and AdaGrad. > I look forward to your feedback, > Konstantinos > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: