From mail at sebastianraschka.com Sat Sep 1 00:44:55 2018 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Fri, 31 Aug 2018 23:44:55 -0500 Subject: [scikit-learn] ANN Scikit-learn 0.20rc1 release candidate available In-Reply-To: <5a532836-7f9f-b548-2ad3-d0b8a40b3011@gmail.com> References: <5a532836-7f9f-b548-2ad3-d0b8a40b3011@gmail.com> Message-ID: <1DD7A599-E12E-427E-9457-45A5BD559C0F@sebastianraschka.com> That's awesome! Congrats and thanks everyone for all the work that went into this! Just finished reading through the What's New docs... Wow, that took a while -- here, in a positive sense ;). It's a huge release with lots of important fixes. It's great to see that you prioritized the maintenance and improvement of scikit-learn as a fundamental ML library, rather than adding useful yet "niche" features. Cheers, Sebastian > On Aug 31, 2018, at 8:26 PM, Andreas Mueller wrote: > > Hey Folks! > > I'm happy to announce that the scikit-learn 0.20 release candidate 1 is now available via conda-forge and pip. > Please help us by testing this release candidate so we can make sure the final release will go seamlessly! > > You can install the release candidate from conda-forge using > > conda install scikit-learn=0.20rc1 -c conda-forge/label/rc -c conda-forge > > (please take into account that if you're using the default conda channel otherwise, this will pull in some other > dependencies from conda-forge). > > You can install the release candidate via pip using > > pip install --pre scikit-learn > > The documentation for 0.20 is available at > > http://scikit-learn.org/0.20/ > > and will move to http://scikit-learn.org/ upon final release. > > You can find the release note with all new features and changes here: > > http://scikit-learn.org/0.20/whats_new.html#version-0-20 > > Thank you for your help in testing the RC and thank you to everybody that made the release possible! > > All the best, > > Andy > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From gael.varoquaux at normalesup.org Sat Sep 1 02:06:25 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Sat, 1 Sep 2018 08:06:25 +0200 Subject: [scikit-learn] ANN Scikit-learn 0.20rc1 release candidate available In-Reply-To: <5a532836-7f9f-b548-2ad3-d0b8a40b3011@gmail.com> References: <5a532836-7f9f-b548-2ad3-d0b8a40b3011@gmail.com> Message-ID: <20180901060625.pqhzzxw6q3vbtheq@phare.normalesup.org> Thanks to everybody involved! This is big! Ga?l On Fri, Aug 31, 2018 at 09:26:39PM -0400, Andreas Mueller wrote: > Hey Folks! > I'm happy to announce that the scikit-learn 0.20 release candidate 1 is now > available via conda-forge and pip. > Please help us by testing this release candidate so we can make sure the > final release will go seamlessly! > You can install the release candidate from conda-forge using > conda install scikit-learn=0.20rc1 -c conda-forge/label/rc -c conda-forge > (please take into account that if you're using the default conda channel > otherwise, this will pull in some other > dependencies from conda-forge). > You can install the release candidate via pip using > pip install --pre scikit-learn > The documentation for 0.20 is available at > http://scikit-learn.org/0.20/ > and will move to http://scikit-learn.org/ upon final release. > You can find the release note with all new features and changes here: > http://scikit-learn.org/0.20/whats_new.html#version-0-20 > Thank you for your help in testing the RC and thank you to everybody that > made the release possible! > All the best, > Andy > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From shokyco at gmail.com Mon Sep 3 07:17:30 2018 From: shokyco at gmail.com (Shuki Cohen) Date: Mon, 3 Sep 2018 14:17:30 +0300 Subject: [scikit-learn] Contribute to Scikit-learn In-Reply-To: References: Message-ID: On Mon, Sep 3, 2018 at 1:21 PM Shuki Cohen wrote: > Hi all, > > Me and a friend of mine found lack of feature selection functionalities in > Scikit-learn and we thought to contribute in order to answer this need. > More specifically, we want to add: > 1. Sequential Forward Selection algorithm > > 2. Multivariate Feature Selection > > to the Scikit-learn code base, and this mail is to get your approval that > such a project has good chances to be added to the next version. > > Thanks in advance > Shuki & Yaniv > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivertomic at zoho.com Mon Sep 3 07:32:49 2018 From: olivertomic at zoho.com (Oliver Tomic) Date: Mon, 03 Sep 2018 13:32:49 +0200 Subject: [scikit-learn] Contribute to Scikit-learn In-Reply-To: References: Message-ID: <1659f34e8bf.1258a2d4332775.8694562337074779743@zoho.com> Hi Shuki and Yaniv, the sequential forward selection algorithm is already implemented in the mlxtend python package, which is complimentary to scikit learn. https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/? best wishes Oliver ---- On Mon, 03 Sep 2018 13:17:30 +0200?Shuki Cohen wrote ---- On Mon, Sep 3, 2018 at 1:21 PM Shuki Cohen wrote: Hi all, Me and a friend of mine found lack of feature selection functionalities in Scikit-learn and we thought to contribute in order to answer this need. More specifically, we want to add: 1.?Sequential Forward Selection algorithm? 2. Multivariate Feature Selection to the Scikit-learn code base, and this mail is to get your approval that such a project has good chances to be added to the next version. Thanks in advance Shuki & Yaniv _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Mon Sep 3 08:50:19 2018 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Mon, 3 Sep 2018 14:50:19 +0200 Subject: [scikit-learn] Contribute to Scikit-learn In-Reply-To: <1659f34e8bf.1258a2d4332775.8694562337074779743@zoho.com> References: <1659f34e8bf.1258a2d4332775.8694562337074779743@zoho.com> Message-ID: I would add that Sequential Forward Selection is on the way to be ported by Sebastian (@rabst) to scikit-learn: https://github.com/scikit-learn/scikit-learn/pull/8684 However, I am sure that Sebastian would be grateful if you wish to take over the PR and to move it forward. But Sebastian is probably going to comment himself ;) Cheers, On Mon, 3 Sep 2018 at 13:35, Oliver Tomic wrote: > > Hi Shuki and Yaniv, > > the sequential forward selection algorithm is already implemented in the mlxtend python package, which is complimentary to scikit learn. > https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/ > > best wishes > Oliver > > > > > ---- On Mon, 03 Sep 2018 13:17:30 +0200 Shuki Cohen wrote ---- > > > > On Mon, Sep 3, 2018 at 1:21 PM Shuki Cohen wrote: > > Hi all, > > Me and a friend of mine found lack of feature selection functionalities in Scikit-learn and we thought to contribute in order to answer this need. More specifically, we want to add: > 1. Sequential Forward Selection algorithm > 2. Multivariate Feature Selection > to the Scikit-learn code base, and this mail is to get your approval that such a project has good chances to be added to the next version. > > Thanks in advance > Shuki & Yaniv > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ From mail at sebastianraschka.com Mon Sep 3 12:43:22 2018 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Mon, 3 Sep 2018 11:43:22 -0500 Subject: [scikit-learn] Contribute to Scikit-learn In-Reply-To: References: <1659f34e8bf.1258a2d4332775.8694562337074779743@zoho.com> Message-ID: Hi all, first of all, I think that having more feature selection capabilities in scikit-learn would be nice, especially, an algorithm from the wrapper category that also regards dependence/interaction between features. Regarding the SequentialFeatureSelection class... We actually decided to simplify this a little bit (compared to the mlxtend variant) and only include the "simple" or "regular" forward and backward selection, and not the floating variants. So, we probably don't want to go overboard and have too many comprehensive algos in a core package such as sklearn, but focus on the main ones whereas we can delegate others (e.g., genetic algorithms, which may implementation-wise rely on an external GP package?) to contrib projects? Anyway, regarding the PR ... I didn't mean to drag in on for that long, but between PR and review, other things always came up and I never got around adding the docs -- I actually forgot at some point then. I think the current state is that the implementation is more or less ok and just needs some polishing maybe. Primarily, what's missing though are the docs and more comprehensive unit tests. This is something I can do in the next few days or weeks (now that I am aware of it) but I also wouldn't mind if someone else works on it. So, let me know if you like to work on the PR, and otherwise, I will make a note for next weekend to look into adding the docs. In any case though, I would appreciate feedback regarding the current implementation. Best, Sebastian > On Sep 3, 2018, at 7:50 AM, Guillaume Lema?tre wrote: > > I would add that Sequential Forward Selection is on the way to be > ported by Sebastian (@rabst) > to scikit-learn: > > https://github.com/scikit-learn/scikit-learn/pull/8684 > > However, I am sure that Sebastian would be grateful if you wish to > take over the PR and to move it forward. > But Sebastian is probably going to comment himself ;) > > Cheers, > On Mon, 3 Sep 2018 at 13:35, Oliver Tomic wrote: >> >> Hi Shuki and Yaniv, >> >> the sequential forward selection algorithm is already implemented in the mlxtend python package, which is complimentary to scikit learn. >> https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/ >> >> best wishes >> Oliver >> >> >> >> >> ---- On Mon, 03 Sep 2018 13:17:30 +0200 Shuki Cohen wrote ---- >> >> >> >> On Mon, Sep 3, 2018 at 1:21 PM Shuki Cohen wrote: >> >> Hi all, >> >> Me and a friend of mine found lack of feature selection functionalities in Scikit-learn and we thought to contribute in order to answer this need. More specifically, we want to add: >> 1. Sequential Forward Selection algorithm >> 2. Multivariate Feature Selection >> to the Scikit-learn code base, and this mail is to get your approval that such a project has good chances to be added to the next version. >> >> Thanks in advance >> Shuki & Yaniv >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From touqir at ualberta.ca Tue Sep 4 12:55:54 2018 From: touqir at ualberta.ca (Touqir Sajed) Date: Tue, 4 Sep 2018 10:55:54 -0600 Subject: [scikit-learn] Optimization algorithms in scikit-learn Message-ID: Hi, I have been looking for stochastic optimization algorithms in scikit-learn that are faster than SGD and so far I have come across Adam and momentum. Are there other methods implemented in scikit-learn? Particularly, the variance reduction methods such as SVRG ( https://papers.nips.cc/paper/4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.pdf ) ? These variance reduction methods are the current state of the art in terms of convergence speed while maintaining runtime complexity of order n -- number of features. If they are not implemented yet, I think it would be really great to implement(I am happy to do so) them since nowadays working on large datasets(where LBGFS may not be practical) is the norm where the improvements are definitely worth it. Cheers, Touqir -- Computing Science Master's student at University of Alberta, Canada, specializing in Machine Learning. Website : https://ca.linkedin.com/in/touqir-sajed-6a95b1126 -------------- next part -------------- An HTML attachment was scrubbed... URL: From touqir at ualberta.ca Tue Sep 4 13:23:37 2018 From: touqir at ualberta.ca (Touqir Sajed) Date: Tue, 4 Sep 2018 11:23:37 -0600 Subject: [scikit-learn] Multi Armed Bandit Algorithms in Scikit-learn Message-ID: Hi, This email is intended to initiate a discussion on whether it is worth adding Multi-Armed Bandit (MAB) algorithms in Scikit-learn. For those of you who have not heard of MAB algorithms, they are the simplest form of decision-making algorithms applicable whenever data with labels are not given beforehand and the objective is to try out different decisions, whenever a sample is seen, and learn which decision is the best in the long run. They are the simplest form of Reinforcement Learning algorithms. While they are not applicable for every decision-making tasks, they naturally fit into a number of problem settings where they are more sample efficient and simpler than the more advanced RL algorithms. For a number of applications : https://www.quora.com/In-what-kind-of-real-life-situations-can-we-use-a-multi-arm-bandit-algorithm. If you want to know more about their usage, how they work or their advantages, feel free to let me know! I do feel that MAB algorithms should be a part of Scikit-learn since a lot of the interesting problems that we face regarding learning is about decision making. There are quite a few github repos with MAB implementations but their coverage is extremely limited and I do not know of any dedicated library on MABs. Companies like Yahoo, Microsoft, Google use MABs for Ad recommendation and search engine optimization but their code is not made public. Cheers, Touqir -- Computing Science Master's student at University of Alberta, Canada, specializing in Machine Learning. Website : https://ca.linkedin.com/in/touqir-sajed-6a95b1126 -------------- next part -------------- An HTML attachment was scrubbed... URL: From touqir at ualberta.ca Tue Sep 4 13:25:47 2018 From: touqir at ualberta.ca (Touqir Sajed) Date: Tue, 4 Sep 2018 11:25:47 -0600 Subject: [scikit-learn] Multi Armed Bandit Algorithms in Scikit-learn In-Reply-To: References: Message-ID: The corrected link : https://www.quora.com/In-what-kind-of-real-life-situations-can-we-use-a-multi-arm-bandit-algorithm ; On Tue, Sep 4, 2018 at 11:23 AM Touqir Sajed wrote: > Hi, > > This email is intended to initiate a discussion on whether it is worth > adding Multi-Armed Bandit (MAB) algorithms in Scikit-learn. For those of > you who have not heard of MAB algorithms, they are the simplest form of > decision-making algorithms applicable whenever data with labels are not > given beforehand and the objective is to try out different decisions, > whenever a sample is seen, and learn which decision is the best in the long > run. They are the simplest form of Reinforcement Learning algorithms. While > they are not applicable for every decision-making tasks, they naturally fit > into a number of problem settings where they are more sample efficient and > simpler than the more advanced RL algorithms. For a number of applications > : > https://www.quora.com/In-what-kind-of-real-life-situations-can-we-use-a-multi-arm-bandit-algorithm. If > > you want to know more about their usage, how they work or their advantages, > feel free to let me know! > > I do feel that MAB algorithms should be a part of Scikit-learn since a lot > of the interesting problems that we face regarding learning is about > decision making. There are quite a few github repos with MAB > implementations but their coverage is extremely limited and I do not know > of any dedicated library on MABs. Companies like Yahoo, Microsoft, Google > use MABs for Ad recommendation and search engine optimization but their > code is not made public. > > Cheers, > Touqir > > -- > Computing Science Master's student at University of Alberta, Canada, > specializing in Machine Learning. Website : > https://ca.linkedin.com/in/touqir-sajed-6a95b1126 > > -- Computing Science Master's student at University of Alberta, Canada, specializing in Machine Learning. Website : https://ca.linkedin.com/in/touqir-sajed-6a95b1126 -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Sep 4 13:44:31 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 4 Sep 2018 13:44:31 -0400 Subject: [scikit-learn] Optimization algorithms in scikit-learn In-Reply-To: References: Message-ID: <0e8f2f86-1856-46d3-59a8-4f1140dc1962@gmail.com> Hi Touqir. We don't usually implement general purpose optimizers in scikit-learn, in particular because usually different optimizers apply to different kinds of problems. For linear models we have SAG and SAGA, for neural nets we have adam. I don't think the authors claim to be faster than SAG, so I'm not sure what the motivation would be for using their method. Best, Andy On 09/04/2018 12:55 PM, Touqir Sajed wrote: > Hi, > > I have been looking for stochastic optimization algorithms in > scikit-learn that are faster than SGD and so far I have come across > Adam and momentum. Are there other methods implemented in > scikit-learn? Particularly, the variance reduction methods such as > SVRG > (https://papers.nips.cc/paper/4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.pdf > ) > ??These variance reduction methods are the current state of the art in > terms of convergence speed while maintaining runtime complexity of > order?n -- number of features. If they are not implemented yet, I > think it would be really great to implement(I am happy to do so) them > since nowadays working on large datasets(where LBGFS may not be > practical) is the norm where the improvements are definitely worth it. > > Cheers, > Touqir > > -- > Computing Science Master's student at University of Alberta, Canada, > specializing in Machine Learning. Website : > https://ca.linkedin.com/in/touqir-sajed-6a95b1126 > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Sep 4 13:47:13 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 4 Sep 2018 13:47:13 -0400 Subject: [scikit-learn] Multi Armed Bandit Algorithms in Scikit-learn In-Reply-To: References: Message-ID: See http://scikit-learn.org/dev/faq.html#what-are-the-inclusion-criteria-for-new-algorithms and http://scikit-learn.org/dev/faq.html#why-is-there-no-support-for-deep-or-reinforcement-learning-will-there-be-support-for-deep-or-reinforcement-learning-in-scikit-learn Bandit algorithms require a fundamentally different kind of interface than what's in scikit-learn right now, as they are sequential decision making algorithms. On 09/04/2018 01:23 PM, Touqir Sajed wrote: > Hi, > > This email is intended to initiate a discussion on whether it is worth > adding Multi-Armed Bandit (MAB) algorithms in Scikit-learn. For those > of you who have not heard of MAB algorithms, they are the simplest > form of decision-making algorithms applicable whenever data with > labels are not given beforehand and the objective is to try out > different decisions, whenever a sample is seen, and learn which > decision is the best in the long run. They are the simplest form of > Reinforcement Learning algorithms. While they are not applicable for > every decision-making tasks, they naturally fit into a number of > problem settings where they are more sample efficient and simpler than > the more advanced RL algorithms. For a number of applications : > https://www.quora.com/In-what-kind-of-real-life-situations-can-we-use-a-multi-arm-bandit-algorithm.?If > > you want to know more about their usage, how they work or their > advantages, feel free to let me know! > > I do feel that MAB algorithms should be a part of Scikit-learn since a > lot of the interesting problems that we face regarding learning is > about decision making. There are quite a few github repos?with MAB > implementations but their coverage is extremely limited and I do not > know of any dedicated library on MABs. Companies like Yahoo, > Microsoft, Google use MABs for Ad recommendation and search engine > optimization but their code is not made public. > > Cheers, > Touqir > > -- > Computing Science Master's student at University of Alberta, Canada, > specializing in Machine Learning. Website : > https://ca.linkedin.com/in/touqir-sajed-6a95b1126 > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From touqir at ualberta.ca Tue Sep 4 14:45:09 2018 From: touqir at ualberta.ca (Touqir Sajed) Date: Tue, 4 Sep 2018 12:45:09 -0600 Subject: [scikit-learn] Optimization algorithms in scikit-learn In-Reply-To: <0e8f2f86-1856-46d3-59a8-4f1140dc1962@gmail.com> References: <0e8f2f86-1856-46d3-59a8-4f1140dc1962@gmail.com> Message-ID: Hi Andreas, Is there a particular reason why there is no general purpose optimization module? Most of the optimizers (atleast the first order methods) are general purpose since you just need to feed the gradient. In some special cases, you probably need problem specific formulation for better performance. The advantage of SVRG is that you don't need to store the gradients which costs a storage of order number_of_weights*number_of_samples which is the main problem with SAG and SAGA. Thus, for most neural network models (and even non-NN models) using SAG and SAGA is infeasible on personal computers. SVRG is not popular in deep learning community but it should be noted that SVRG is different from Adam since it does not tune the step size. Just to clarify, SVRG can be faster than Adam since it decreases the variance to achieve a similar convergence rate as full batch methods while being computationally cheap like SGD/Adam. However, one can combine both methods to obtain an even faster algorithm. Cheers, Touqir On Tue, Sep 4, 2018 at 11:46 AM Andreas Mueller wrote: > Hi Touqir. > We don't usually implement general purpose optimizers in > scikit-learn, in particular because usually different optimizers > apply to different kinds of problems. > For linear models we have SAG and SAGA, for neural nets we have adam. > I don't think the authors claim to be faster than SAG, so I'm not sure > what the > motivation would be for using their method. > > Best, > Andy > > > On 09/04/2018 12:55 PM, Touqir Sajed wrote: > > Hi, > > I have been looking for stochastic optimization algorithms in scikit-learn > that are faster than SGD and so far I have come across Adam and momentum. > Are there other methods implemented in scikit-learn? Particularly, the > variance reduction methods such as SVRG ( > https://papers.nips.cc/paper/4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.pdf > ) > ? These variance reduction methods are the current state of the art in > terms of convergence speed while maintaining runtime complexity of order n > -- number of features. If they are not implemented yet, I think it would be > really great to implement(I am happy to do so) them since nowadays working > on large datasets(where LBGFS may not be practical) is the norm where the > improvements are definitely worth it. > > Cheers, > Touqir > > -- > Computing Science Master's student at University of Alberta, Canada, > specializing in Machine Learning. Website : > https://ca.linkedin.com/in/touqir-sajed-6a95b1126 > > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Computing Science Master's student at University of Alberta, Canada, specializing in Machine Learning. Website : https://ca.linkedin.com/in/touqir-sajed-6a95b1126 -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Tue Sep 4 14:53:40 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 4 Sep 2018 20:53:40 +0200 Subject: [scikit-learn] Optimization algorithms in scikit-learn In-Reply-To: References: <0e8f2f86-1856-46d3-59a8-4f1140dc1962@gmail.com> Message-ID: <20180904185340.ldeat4plsgol3j2u@phare.normalesup.org> This is out of the scope of scikit-learn, which is a toolkit meant to be used for easier machine learning. Optimization is a component of machine learning, but not one that is readily-useable by itself. Ga?l On Tue, Sep 04, 2018 at 12:45:09PM -0600, Touqir Sajed wrote: > Hi Andreas, > Is there a particular reason why there is no general purpose optimization > module? Most of the optimizers (atleast?the first order methods) are general > purpose since you just need to feed the gradient. In some special cases, you > probably need problem specific formulation for better performance. The > advantage of SVRG is that you don't need to store the gradients which costs a > storage of order number_of_weights*number_of_samples which is the main problem > with SAG and SAGA. Thus, for most neural network models (and even non-NN > models) using SAG and SAGA is infeasible on personal computers.? > SVRG is not popular in deep learning community but it should be noted that SVRG > is different from Adam since it does not tune the step size. Just to clarify, > SVRG can be faster than Adam since it decreases the variance to achieve a > similar convergence rate as full batch methods while being computationally > cheap like SGD/Adam. However, one can combine both methods to obtain an even > faster algorithm. > Cheers, > Touqir > * > On Tue, Sep 4, 2018 at 11:46 AM Andreas Mueller wrote: > Hi Touqir. > We don't usually implement general purpose optimizers in > scikit-learn, in particular because usually different optimizers > apply to different kinds of problems. > For linear models we have SAG and SAGA, for neural nets we have adam. > I don't think the authors claim to be faster than SAG, so I'm not sure what > the > motivation would be for using their method. > Best, > Andy > On 09/04/2018 12:55 PM, Touqir Sajed wrote: > Hi, > I have been looking for stochastic optimization algorithms in > scikit-learn that are faster than SGD and so far I have come across > Adam and momentum. Are there other methods implemented in scikit-learn? > Particularly, the variance reduction methods such as SVRG (https:// > papers.nips.cc/paper/ > 4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.pdf > ) ??These variance reduction methods are the current state of the art > in terms of convergence speed while maintaining runtime complexity of > order?n -- number of features. If they are not implemented yet, I think > it would be really great to implement(I am happy to do so) them since > nowadays working on large datasets(where LBGFS may not be practical) is > the norm where the improvements are definitely worth it. > Cheers, > Touqir -- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From mcasl at unileon.es Thu Sep 6 07:13:58 2018 From: mcasl at unileon.es (=?UTF-8?Q?Manuel_CASTEJ=C3=93N_LIMAS?=) Date: Thu, 6 Sep 2018 13:13:58 +0200 Subject: [scikit-learn] CircleCI Message-ID: Dear all, Contrib projects template hints the authors to use TravisCI, CircleCI and Appveyor. Now that CircleCI has moved to version 2, is there any idea on what to do about it? Will the template be updated? Is it ok if we use only CircleCI? What do you, core devs, suggest about that? Best wishes Manuel -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Thu Sep 6 08:22:22 2018 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Thu, 6 Sep 2018 14:22:22 +0200 Subject: [scikit-learn] CircleCI In-Reply-To: References: Message-ID: Hi Manuel, Basically, you are free to take any initiative with your CIs until it is cross-platform tested. Using the different CI available allows to speed-up the testing. In scikit-learn, we use Travis for Linux checking, Appveyor for Windows, and CircleCI for building the documentation. You could use a single CI service for all of those. However, I am not sure that you have Windows support apart of Appveyor. I think that we should update the template of the scikit-learn-contrib with the new template for circle ci 2. Cheers, On Thu, 6 Sep 2018 at 13:16, Manuel CASTEJ?N LIMAS via scikit-learn wrote: > > Dear all, > Contrib projects template hints the authors to use TravisCI, CircleCI and Appveyor. Now that CircleCI has moved to version 2, is there any idea on what to do about it? Will the template be updated? Is it ok if we use only CircleCI? > What do you, core devs, suggest about that? > Best wishes > Manuel > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ From manuel.castejon at gmail.com Mon Sep 10 19:02:13 2018 From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=) Date: Tue, 11 Sep 2018 01:02:13 +0200 Subject: [scikit-learn] CircleCI In-Reply-To: References: Message-ID: Thank you for your feedback Guillaume! I'm fighting with CircleCI configuration yet, but it seems it will be possible to test the windows versions as well, provided that nowadays we already have windows containers available. I'll give it a try and then I will share the experience. For what I've seen in scikit-learn build_tools scripts, there is room for speeding up things by integrating the miniconda installation in the docker image and not on every build job. Best Manuel El jue., 6 sept. 2018 a las 14:24, Guillaume Lema?tre (< g.lemaitre58 at gmail.com>) escribi?: > Hi Manuel, > > Basically, you are free to take any initiative with your CIs until it > is cross-platform tested. Using the different CI available allows to > speed-up the testing. In scikit-learn, we use Travis for Linux > checking, Appveyor for Windows, and CircleCI for building the > documentation. You could use a single CI service for all of those. > However, I am not sure that you have Windows support apart of > Appveyor. > > I think that we should update the template of the scikit-learn-contrib > with the new template for circle ci 2. > > Cheers, > On Thu, 6 Sep 2018 at 13:16, Manuel CASTEJ?N LIMAS via scikit-learn > wrote: > > > > Dear all, > > Contrib projects template hints the authors to use TravisCI, CircleCI > and Appveyor. Now that CircleCI has moved to version 2, is there any idea > on what to do about it? Will the template be updated? Is it ok if we use > only CircleCI? > > What do you, core devs, suggest about that? > > Best wishes > > Manuel > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pete.soderling at hakkalabs.co Wed Sep 12 08:50:21 2018 From: pete.soderling at hakkalabs.co (Pete Soderling) Date: Wed, 12 Sep 2018 17:50:21 +0500 Subject: [scikit-learn] Sharing Data Projects knowledge Message-ID: Allow me to introduce you to DataEngConf , a new event that helps technical leaders understand data tools, platforms and algorithms like never before. I recently viewed through past SciKit Learn submissions and was intrigued by the similarities between your group and our event. I?m the founder of the annual no bullshit data event for deeply technical professionals hosted annually in SF, NYC, & Barcelona and with many excellent speakers each year. DataEngConf is not the standard watered-down ?big data? conference you might have attended previously - instead we are built and run by software engineers, developers and data geeks. We have teams from across Europe, the US and Asia participating this year from companies like Google, BBVA, Apache Foundation, UBER, Schibsted, LINE, Badi, LetGo, Eurecat, Satellogic, Fishtown, Datadog, Intermix, GO-JEK, Databricks, Datadog, Cabify, Eurecat and Yara and many more. I am so excited about SciKit Learn and what this community can bring to our event that: I would like to offer all of you a 30% discount code Community30 and you can purchase tickets here. Hope to hear from you. Cheers, -pete -------------- next part -------------- An HTML attachment was scrubbed... URL: From uri at goren4u.com Thu Sep 13 16:21:10 2018 From: uri at goren4u.com (Uri Goren) Date: Thu, 13 Sep 2018 23:21:10 +0300 Subject: [scikit-learn] sklearn Pipeline decorators Message-ID: Hi, sklearn Pipelines are awesome, I use them all the time for everything. I've been writing a lot custom transformers lately, and since most of my transformers require no fitting (e.g. replacing all number with "{NUM}" token), I started using transformers as decorators. See snippet (example usage at the end): https://github.com/urigoren/decorators4DS/blob/master/decorators4DS/sklearn_dec.py Do you think this kind of addition should be a part of sklearn ? -- *Uri Goren,* *Phone: +972-507-649-650* *EMail: uri at goren4u.com * *Linkedin: il.linkedin.com/in/ugoren/ * -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Thu Sep 13 17:44:50 2018 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Thu, 13 Sep 2018 23:44:50 +0200 Subject: [scikit-learn] sklearn Pipeline decorators In-Reply-To: References: Message-ID: That's a cool trick but I am worried it would render our API too "frameworky" for my taste. I think the FunctionTransformer is enough: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Mon Sep 17 10:24:57 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Mon, 17 Sep 2018 16:24:57 +0200 Subject: [scikit-learn] Announcing scikit-learn at Inria foundation Message-ID: <20180917142457.nyleear4yvqwegkq@phare.normalesup.org> Hi scikit-learn community, I am very happy to announce a foundation to support scikit-learn at Inria: https://scikit-learn.fondation-inria.fr In practice, this gives us a legal vessel to receive money from private entities, and not only research money. This money will provide a stable job to some people working on scikit-learn here, at Inria, and allow us to grow the team and target more ambitious features as well as better quality and hopefully more frequent releases. I have written a blog post about the motivations and the vision behind this new development: http://gael-varoquaux.info/programming/a-foundation-for-scikit-learn-at-inria.html Thank you all for being part of the scikit-learn adventure. I am very excited about the new prospects that this is bringing us, Ga?l From daniel.saxton at gmail.com Mon Sep 17 20:26:37 2018 From: daniel.saxton at gmail.com (Daniel Saxton) Date: Mon, 17 Sep 2018 19:26:37 -0500 Subject: [scikit-learn] Bootstrapping in sklearn Message-ID: Hi all, As everyone knows sklearn is excellent for building predictive models, but an area where I believe there is still work to be done is in coming up with measurements for the inherent uncertainties in those models. (That there is an appetite for this is I believe evidenced by the rise in popularity of probabilistic programming.) We can, for example, easily find point estimates for coefficients of linear models in sklearn, but making inferences from those point estimates is not possible without measurements of probable error. To address this and other problems I authored a package called resample which implements the bootstrap and other randomization-based procedures with the goal of performing largely nonparametric statistical inference on a wide class of problems. The package is built entirely in numpy and scipy and so already integrates fairly well with sklearn (there is a tutorial here which among other things shows applications using the Boston housing data: https://github.com/dsaxton/resample/blob/master/doc/resample.ipynb) Might there be interest in including something like this as an sklearn-contrib package? Essentially we are taking what is already in sklearn.utils.resample and extending it to include other forms of the bootstrap (e.g., balanced, parametric, stratified and / or smoothed), algorithms for computing automatic confidence intervals, and procedures for doing nonparametric, randomization-based hypothesis testing. Here is the Github page: https://github.com/dsaxton/resample Of course, I also would greatly appreciate any input that others might have on ways that this package could be made more useful. Thanks, Daniel -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Tue Sep 18 03:41:30 2018 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Tue, 18 Sep 2018 09:41:30 +0200 Subject: [scikit-learn] Bootstrapping in sklearn In-Reply-To: References: Message-ID: This looks like a very useful project. There is also scikits-bootstraps [1]. Personally I prefer the flat package namespace of resample (I am not a fan of the 'scikits' namespace package) but I still think it would be great to contact the author to know if he would be interested in joining efforts. What currently lacks from both projects is a good sphinx-based documentation that explains in a couple of paragraphs with examples what are the different non-parametric inference methods, what are the pros and cons for each of them (sample complexity, computation complexity, kinds of inference, bias, theoretical asymptotic results, practical discrepancies observed in the finite sample setting, assumptions made on the distribution of the data...) and ideally the doc would have reference to examples (using sphinx-gallery) that would highlight the behavior of the tools in both nominal and pathological cases. [1] https://github.com/cgevans/scikits-bootstrap -- Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.saxton at gmail.com Tue Sep 18 07:35:31 2018 From: daniel.saxton at gmail.com (Daniel Saxton) Date: Tue, 18 Sep 2018 06:35:31 -0500 Subject: [scikit-learn] Bootstrapping in sklearn In-Reply-To: References: Message-ID: Great, I went ahead and contacted Constantine. Documentation was actually the next thing that I wanted to work on, so hopefully he and I can put something together. Thanks for the help. On Tue, Sep 18, 2018 at 2:42 AM Olivier Grisel wrote: > This looks like a very useful project. > > There is also scikits-bootstraps [1]. Personally I prefer the flat package > namespace of resample (I am not a fan of the 'scikits' namespace package) > but I still think it would be great to contact the author to know if he > would be interested in joining efforts. > > What currently lacks from both projects is a good sphinx-based > documentation that explains in a couple of paragraphs with examples what > are the different non-parametric inference methods, what are the pros and > cons for each of them (sample complexity, computation complexity, kinds of > inference, bias, theoretical asymptotic results, practical discrepancies > observed in the finite sample setting, assumptions made on the distribution > of the data...) and ideally the doc would have reference to examples (using > sphinx-gallery) that would highlight the behavior of the tools in both > nominal and pathological cases. > > [1] https://github.com/cgevans/scikits-bootstrap > > -- > Olivier > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Tue Sep 18 09:46:37 2018 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Tue, 18 Sep 2018 22:46:37 +0900 Subject: [scikit-learn] Bootstrapping in sklearn In-Reply-To: References: Message-ID: Resampling is a very important interesting contribution which relates very closely to my primary research in applied ML for chemical development. I'd be very interested in contributing documentation and learning new things along the way, but I potentially would be perceived as slow because of juggling many projects and responsibilities. (I failed once before at timely reviewing of a PR for multi-metric optimization for 0.19.) If still acceptable, please let me know, and I'm happy to try to help. J.B. 2018?9?18?(?) 20:37 Daniel Saxton : > Great, I went ahead and contacted Constantine. Documentation was actually > the next thing that I wanted to work on, so hopefully he and I can put > something together. > > Thanks for the help. > > On Tue, Sep 18, 2018 at 2:42 AM Olivier Grisel > wrote: > >> This looks like a very useful project. >> >> There is also scikits-bootstraps [1]. Personally I prefer the flat >> package namespace of resample (I am not a fan of the 'scikits' namespace >> package) but I still think it would be great to contact the author to know >> if he would be interested in joining efforts. >> >> What currently lacks from both projects is a good sphinx-based >> documentation that explains in a couple of paragraphs with examples what >> are the different non-parametric inference methods, what are the pros and >> cons for each of them (sample complexity, computation complexity, kinds of >> inference, bias, theoretical asymptotic results, practical discrepancies >> observed in the finite sample setting, assumptions made on the distribution >> of the data...) and ideally the doc would have reference to examples (using >> sphinx-gallery) that would highlight the behavior of the tools in both >> nominal and pathological cases. >> >> [1] https://github.com/cgevans/scikits-bootstrap >> >> -- >> Olivier >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.saxton at gmail.com Tue Sep 18 10:23:11 2018 From: daniel.saxton at gmail.com (Daniel Saxton) Date: Tue, 18 Sep 2018 09:23:11 -0500 Subject: [scikit-learn] Bootstrapping in sklearn In-Reply-To: References: Message-ID: J.B., Any help would certainly be welcome, no matter how slow. I appreciate the interest. Thanks, Daniel On Tue, Sep 18, 2018, 8:47 AM Brown J.B. via scikit-learn < scikit-learn at python.org> wrote: > Resampling is a very important interesting contribution which relates very > closely to my primary research in applied ML for chemical development. > I'd be very interested in contributing documentation and learning new > things along the way, but I potentially would be perceived as slow because > of juggling many projects and responsibilities. > (I failed once before at timely reviewing of a PR for multi-metric > optimization for 0.19.) > If still acceptable, please let me know, and I'm happy to try to help. > > J.B. > > > 2018?9?18?(?) 20:37 Daniel Saxton : > >> Great, I went ahead and contacted Constantine. Documentation was >> actually the next thing that I wanted to work on, so hopefully he and I can >> put something together. >> >> Thanks for the help. >> >> On Tue, Sep 18, 2018 at 2:42 AM Olivier Grisel >> wrote: >> >>> This looks like a very useful project. >>> >>> There is also scikits-bootstraps [1]. Personally I prefer the flat >>> package namespace of resample (I am not a fan of the 'scikits' namespace >>> package) but I still think it would be great to contact the author to know >>> if he would be interested in joining efforts. >>> >>> What currently lacks from both projects is a good sphinx-based >>> documentation that explains in a couple of paragraphs with examples what >>> are the different non-parametric inference methods, what are the pros and >>> cons for each of them (sample complexity, computation complexity, kinds of >>> inference, bias, theoretical asymptotic results, practical discrepancies >>> observed in the finite sample setting, assumptions made on the distribution >>> of the data...) and ideally the doc would have reference to examples (using >>> sphinx-gallery) that would highlight the behavior of the tools in both >>> nominal and pathological cases. >>> >>> [1] https://github.com/cgevans/scikits-bootstrap >>> >>> -- >>> Olivier >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From emmanuelarias30 at gmail.com Tue Sep 18 12:33:35 2018 From: emmanuelarias30 at gmail.com (eamanu15) Date: Tue, 18 Sep 2018 13:33:35 -0300 Subject: [scikit-learn] Bootstrapping in sklearn Message-ID: Hello! > Any help would certainly be welcome, no matter how slow. I appreciate the > interest. > That sound interest! If you need help, let me know! I would be happy to help Regards! Emmanuel -------------- next part -------------- An HTML attachment was scrubbed... URL: From luiz.gh at gmail.com Wed Sep 19 10:40:52 2018 From: luiz.gh at gmail.com (Luiz Gustavo Hafemann) Date: Wed, 19 Sep 2018 10:40:52 -0400 Subject: [scikit-learn] Issues with clone for ensemble of classifiers Message-ID: Hello, I am one of the developers of a library for Dynamic Ensemble Selection (DES) methods (the library is called DESlib), and we are currently working to get the library fully compatible with scikit-learn (to submit it to scikit-learn-contrib). We have "check_estimator" working for most of the classes, but now I am having problems to make the classes compatible with GridSearch / other CV functions. One of the main use cases of this library is to facilitate research on this field, and this led to a design decision that the base classifiers are fit by the user, and the DES methods receive a pool of base classifiers that were already fit (this allow users to compare many DES techniques with the same base classifiers). This is creating an issue with GridSearch, since the clone method (defined in sklearn.base) is not cloning the classes as we would like. It does a shallow (non-deep) copy of the parameters, but we would like the pool of base classifiers to be deep-copied. I analyzed this issue and I could not find a solution that does not require changes on the scikit-learn code. Here is the sequence of steps that cause the problem: 1. GridSearchCV calls "clone" on the DES estimator (link ) 2. The clone function calls the "get_params" function of the DES estimator (link , line 60). We don't re-implement this function, so it gets all the parameters, including the pool of classifiers (at this point, they are still "fitted") 3. The clone function then clones each parameter with safe=False (line 62). When cloning the pool of classifiers, the result is a pool that is not "fitted" anymore. The problem is that, to my knowledge, there is no way for my classifier to inform "clone" that a parameter should be always deep copied. I see that other ensemble methods in sklearn always fit the base classifiers within the "fit" method of the ensemble, so this problem does not happen there. I would like to know if there is a solution for this problem while having the base classifiers fitted elsewhere. Here is a short code that reproduces the issue: --------------------------- from sklearn.model_selection import GridSearchCV, train_test_split from sklearn.base import BaseEstimator, ClassifierMixin from sklearn.ensemble import BaggingClassifier from sklearn.datasets import load_iris class MyClassifier(BaseEstimator, ClassifierMixin): def __init__(self, base_classifiers, k): self.base_classifiers = base_classifiers # Base classifiers that are already trained self.k = k # Simulate a parameter that we want to do a grid search on def fit(self, X_dsel, y_dsel): pass # Here we would fit any parameters for the Dynamic selection method, not the base classifiers def predict(self, X): return self.base_classifiers.predict(X) # In practice the methods would do something with the predictions of each classifier X, y = load_iris(return_X_y=True) X_train, X_dsel, y_train, y_dsel = train_test_split(X, y, test_size=0.5) base_classifiers = BaggingClassifier() base_classifiers.fit(X_train, y_train) clf = MyClassifier(base_classifiers, k=1) params = {'k': [1, 3, 5, 7]} grid = GridSearchCV(clf, params) grid.fit(X_dsel, y_dsel) # Raises error that the bagging classifiers are not fitted --------------------------- Btw, here is the branch that we are using to make the library compatible with sklearn: https://github.com/Menelau/DESlib/tree/sklearn-estimators. The failing test related to this issue is in https://github.com/Menelau/DESlib/blob/sklearn-estimators/deslib/tests/test_des_integration.py#L36 Thanks in advance for any help on this case, Luiz Gustavo Hafemann -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Wed Sep 19 11:31:25 2018 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Wed, 19 Sep 2018 17:31:25 +0200 Subject: [scikit-learn] Issues with clone for ensemble of classifiers In-Reply-To: References: Message-ID: You don't have anywhere in your class MyClassifier where you are calling base_classifier.fit(...) therefore when calling base_classifier.predict(...) it will let you know that you did not fit it. On Wed, 19 Sep 2018 at 16:43, Luiz Gustavo Hafemann wrote: > > Hello, > > I am one of the developers of a library for Dynamic Ensemble Selection (DES) methods (the library is called DESlib), and we are currently working to get the library fully compatible with scikit-learn (to submit it to scikit-learn-contrib). We have "check_estimator" working for most of the classes, but now I am having problems to make the classes compatible with GridSearch / other CV functions. > > One of the main use cases of this library is to facilitate research on this field, and this led to a design decision that the base classifiers are fit by the user, and the DES methods receive a pool of base classifiers that were already fit (this allow users to compare many DES techniques with the same base classifiers). This is creating an issue with GridSearch, since the clone method (defined in sklearn.base) is not cloning the classes as we would like. It does a shallow (non-deep) copy of the parameters, but we would like the pool of base classifiers to be deep-copied. > > I analyzed this issue and I could not find a solution that does not require changes on the scikit-learn code. Here is the sequence of steps that cause the problem: > > GridSearchCV calls "clone" on the DES estimator (link) > The clone function calls the "get_params" function of the DES estimator (link, line 60). We don't re-implement this function, so it gets all the parameters, including the pool of classifiers (at this point, they are still "fitted") > The clone function then clones each parameter with safe=False (line 62). When cloning the pool of classifiers, the result is a pool that is not "fitted" anymore. > > The problem is that, to my knowledge, there is no way for my classifier to inform "clone" that a parameter should be always deep copied. I see that other ensemble methods in sklearn always fit the base classifiers within the "fit" method of the ensemble, so this problem does not happen there. I would like to know if there is a solution for this problem while having the base classifiers fitted elsewhere. > > Here is a short code that reproduces the issue: > > --------------------------- > > from sklearn.model_selection import GridSearchCV, train_test_split > from sklearn.base import BaseEstimator, ClassifierMixin > from sklearn.ensemble import BaggingClassifier > from sklearn.datasets import load_iris > > > class MyClassifier(BaseEstimator, ClassifierMixin): > def __init__(self, base_classifiers, k): > self.base_classifiers = base_classifiers # Base classifiers that are already trained > self.k = k # Simulate a parameter that we want to do a grid search on > > def fit(self, X_dsel, y_dsel): > pass # Here we would fit any parameters for the Dynamic selection method, not the base classifiers > > def predict(self, X): > return self.base_classifiers.predict(X) # In practice the methods would do something with the predictions of each classifier > > > X, y = load_iris(return_X_y=True) > X_train, X_dsel, y_train, y_dsel = train_test_split(X, y, test_size=0.5) > > base_classifiers = BaggingClassifier() > base_classifiers.fit(X_train, y_train) > > clf = MyClassifier(base_classifiers, k=1) > > params = {'k': [1, 3, 5, 7]} > grid = GridSearchCV(clf, params) > > grid.fit(X_dsel, y_dsel) # Raises error that the bagging classifiers are not fitted > > --------------------------- > > Btw, here is the branch that we are using to make the library compatible with sklearn: https://github.com/Menelau/DESlib/tree/sklearn-estimators. The failing test related to this issue is in https://github.com/Menelau/DESlib/blob/sklearn-estimators/deslib/tests/test_des_integration.py#L36 > > Thanks in advance for any help on this case, > > Luiz Gustavo Hafemann > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ From g.lemaitre58 at gmail.com Wed Sep 19 11:34:46 2018 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Wed, 19 Sep 2018 17:34:46 +0200 Subject: [scikit-learn] Issues with clone for ensemble of classifiers In-Reply-To: References: Message-ID: Ups I misread your comment. I don't think that we have currently a mechanism to avoid cloning classifier internally. On Wed, 19 Sep 2018 at 17:31, Guillaume Lema?tre wrote: > > You don't have anywhere in your class MyClassifier where you are > calling base_classifier.fit(...) therefore when calling > base_classifier.predict(...) it will let you know that you did not fit > it. > > On Wed, 19 Sep 2018 at 16:43, Luiz Gustavo Hafemann wrote: > > > > Hello, > > > > I am one of the developers of a library for Dynamic Ensemble Selection (DES) methods (the library is called DESlib), and we are currently working to get the library fully compatible with scikit-learn (to submit it to scikit-learn-contrib). We have "check_estimator" working for most of the classes, but now I am having problems to make the classes compatible with GridSearch / other CV functions. > > > > One of the main use cases of this library is to facilitate research on this field, and this led to a design decision that the base classifiers are fit by the user, and the DES methods receive a pool of base classifiers that were already fit (this allow users to compare many DES techniques with the same base classifiers). This is creating an issue with GridSearch, since the clone method (defined in sklearn.base) is not cloning the classes as we would like. It does a shallow (non-deep) copy of the parameters, but we would like the pool of base classifiers to be deep-copied. > > > > I analyzed this issue and I could not find a solution that does not require changes on the scikit-learn code. Here is the sequence of steps that cause the problem: > > > > GridSearchCV calls "clone" on the DES estimator (link) > > The clone function calls the "get_params" function of the DES estimator (link, line 60). We don't re-implement this function, so it gets all the parameters, including the pool of classifiers (at this point, they are still "fitted") > > The clone function then clones each parameter with safe=False (line 62). When cloning the pool of classifiers, the result is a pool that is not "fitted" anymore. > > > > The problem is that, to my knowledge, there is no way for my classifier to inform "clone" that a parameter should be always deep copied. I see that other ensemble methods in sklearn always fit the base classifiers within the "fit" method of the ensemble, so this problem does not happen there. I would like to know if there is a solution for this problem while having the base classifiers fitted elsewhere. > > > > Here is a short code that reproduces the issue: > > > > --------------------------- > > > > from sklearn.model_selection import GridSearchCV, train_test_split > > from sklearn.base import BaseEstimator, ClassifierMixin > > from sklearn.ensemble import BaggingClassifier > > from sklearn.datasets import load_iris > > > > > > class MyClassifier(BaseEstimator, ClassifierMixin): > > def __init__(self, base_classifiers, k): > > self.base_classifiers = base_classifiers # Base classifiers that are already trained > > self.k = k # Simulate a parameter that we want to do a grid search on > > > > def fit(self, X_dsel, y_dsel): > > pass # Here we would fit any parameters for the Dynamic selection method, not the base classifiers > > > > def predict(self, X): > > return self.base_classifiers.predict(X) # In practice the methods would do something with the predictions of each classifier > > > > > > X, y = load_iris(return_X_y=True) > > X_train, X_dsel, y_train, y_dsel = train_test_split(X, y, test_size=0.5) > > > > base_classifiers = BaggingClassifier() > > base_classifiers.fit(X_train, y_train) > > > > clf = MyClassifier(base_classifiers, k=1) > > > > params = {'k': [1, 3, 5, 7]} > > grid = GridSearchCV(clf, params) > > > > grid.fit(X_dsel, y_dsel) # Raises error that the bagging classifiers are not fitted > > > > --------------------------- > > > > Btw, here is the branch that we are using to make the library compatible with sklearn: https://github.com/Menelau/DESlib/tree/sklearn-estimators. The failing test related to this issue is in https://github.com/Menelau/DESlib/blob/sklearn-estimators/deslib/tests/test_des_integration.py#L36 > > > > Thanks in advance for any help on this case, > > > > Luiz Gustavo Hafemann > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ From g.lemaitre58 at gmail.com Wed Sep 19 11:38:46 2018 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Wed, 19 Sep 2018 17:38:46 +0200 Subject: [scikit-learn] Issues with clone for ensemble of classifiers In-Reply-To: References: Message-ID: However, there is some issue to frozen a fitted classifier. You can refer to: https://github.com/scikit-learn/scikit-learn/issues/8370 with the associated discussion. On Wed, 19 Sep 2018 at 17:34, Guillaume Lema?tre wrote: > > Ups I misread your comment. I don't think that we have currently a > mechanism to avoid cloning classifier internally. > On Wed, 19 Sep 2018 at 17:31, Guillaume Lema?tre wrote: > > > > You don't have anywhere in your class MyClassifier where you are > > calling base_classifier.fit(...) therefore when calling > > base_classifier.predict(...) it will let you know that you did not fit > > it. > > > > On Wed, 19 Sep 2018 at 16:43, Luiz Gustavo Hafemann wrote: > > > > > > Hello, > > > > > > I am one of the developers of a library for Dynamic Ensemble Selection (DES) methods (the library is called DESlib), and we are currently working to get the library fully compatible with scikit-learn (to submit it to scikit-learn-contrib). We have "check_estimator" working for most of the classes, but now I am having problems to make the classes compatible with GridSearch / other CV functions. > > > > > > One of the main use cases of this library is to facilitate research on this field, and this led to a design decision that the base classifiers are fit by the user, and the DES methods receive a pool of base classifiers that were already fit (this allow users to compare many DES techniques with the same base classifiers). This is creating an issue with GridSearch, since the clone method (defined in sklearn.base) is not cloning the classes as we would like. It does a shallow (non-deep) copy of the parameters, but we would like the pool of base classifiers to be deep-copied. > > > > > > I analyzed this issue and I could not find a solution that does not require changes on the scikit-learn code. Here is the sequence of steps that cause the problem: > > > > > > GridSearchCV calls "clone" on the DES estimator (link) > > > The clone function calls the "get_params" function of the DES estimator (link, line 60). We don't re-implement this function, so it gets all the parameters, including the pool of classifiers (at this point, they are still "fitted") > > > The clone function then clones each parameter with safe=False (line 62). When cloning the pool of classifiers, the result is a pool that is not "fitted" anymore. > > > > > > The problem is that, to my knowledge, there is no way for my classifier to inform "clone" that a parameter should be always deep copied. I see that other ensemble methods in sklearn always fit the base classifiers within the "fit" method of the ensemble, so this problem does not happen there. I would like to know if there is a solution for this problem while having the base classifiers fitted elsewhere. > > > > > > Here is a short code that reproduces the issue: > > > > > > --------------------------- > > > > > > from sklearn.model_selection import GridSearchCV, train_test_split > > > from sklearn.base import BaseEstimator, ClassifierMixin > > > from sklearn.ensemble import BaggingClassifier > > > from sklearn.datasets import load_iris > > > > > > > > > class MyClassifier(BaseEstimator, ClassifierMixin): > > > def __init__(self, base_classifiers, k): > > > self.base_classifiers = base_classifiers # Base classifiers that are already trained > > > self.k = k # Simulate a parameter that we want to do a grid search on > > > > > > def fit(self, X_dsel, y_dsel): > > > pass # Here we would fit any parameters for the Dynamic selection method, not the base classifiers > > > > > > def predict(self, X): > > > return self.base_classifiers.predict(X) # In practice the methods would do something with the predictions of each classifier > > > > > > > > > X, y = load_iris(return_X_y=True) > > > X_train, X_dsel, y_train, y_dsel = train_test_split(X, y, test_size=0.5) > > > > > > base_classifiers = BaggingClassifier() > > > base_classifiers.fit(X_train, y_train) > > > > > > clf = MyClassifier(base_classifiers, k=1) > > > > > > params = {'k': [1, 3, 5, 7]} > > > grid = GridSearchCV(clf, params) > > > > > > grid.fit(X_dsel, y_dsel) # Raises error that the bagging classifiers are not fitted > > > > > > --------------------------- > > > > > > Btw, here is the branch that we are using to make the library compatible with sklearn: https://github.com/Menelau/DESlib/tree/sklearn-estimators. The failing test related to this issue is in https://github.com/Menelau/DESlib/blob/sklearn-estimators/deslib/tests/test_des_integration.py#L36 > > > > > > Thanks in advance for any help on this case, > > > > > > Luiz Gustavo Hafemann > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > -- > > Guillaume Lemaitre > > INRIA Saclay - Parietal team > > Center for Data Science Paris-Saclay > > https://glemaitre.github.io/ > > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ From luiz.gh at gmail.com Wed Sep 19 13:10:39 2018 From: luiz.gh at gmail.com (Luiz Gustavo Hafemann) Date: Wed, 19 Sep 2018 13:10:39 -0400 Subject: [scikit-learn] Issues with clone for ensemble of, classifiers In-Reply-To: References: Message-ID: Guillaume - thank you for the comments. Indeed, an approach to "freeze" a fitted classifier would solve our problem. The Github issue seems to be inactive for a while, but I will check if anyone else is working on it. Luiz Gustavo On Wed, Sep 19, 2018 at 12:02 PM wrote: > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Re: Issues with clone for ensemble of classifiers > (Guillaume Lema?tre) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 19 Sep 2018 17:38:46 +0200 > From: Guillaume Lema?tre > To: Scikit-learn user and developer mailing list > > Subject: Re: [scikit-learn] Issues with clone for ensemble of > classifiers > Message-ID: > k2j_kOOnFPBQt0g at mail.gmail.com> > Content-Type: text/plain; charset="UTF-8" > > However, there is some issue to frozen a fitted classifier. You can refer > to: > > https://github.com/scikit-learn/scikit-learn/issues/8370 > > with the associated discussion. > On Wed, 19 Sep 2018 at 17:34, Guillaume Lema?tre > wrote: > > > > Ups I misread your comment. I don't think that we have currently a > > mechanism to avoid cloning classifier internally. > > On Wed, 19 Sep 2018 at 17:31, Guillaume Lema?tre > wrote: > > > > > > You don't have anywhere in your class MyClassifier where you are > > > calling base_classifier.fit(...) therefore when calling > > > base_classifier.predict(...) it will let you know that you did not fit > > > it. > > > > > > On Wed, 19 Sep 2018 at 16:43, Luiz Gustavo Hafemann > wrote: > > > > > > > > Hello, > > > > > > > > I am one of the developers of a library for Dynamic Ensemble > Selection (DES) methods (the library is called DESlib), and we are > currently working to get the library fully compatible with scikit-learn (to > submit it to scikit-learn-contrib). We have "check_estimator" working for > most of the classes, but now I am having problems to make the classes > compatible with GridSearch / other CV functions. > > > > > > > > One of the main use cases of this library is to facilitate research > on this field, and this led to a design decision that the base classifiers > are fit by the user, and the DES methods receive a pool of base classifiers > that were already fit (this allow users to compare many DES techniques with > the same base classifiers). This is creating an issue with GridSearch, > since the clone method (defined in sklearn.base) is not cloning the classes > as we would like. It does a shallow (non-deep) copy of the parameters, but > we would like the pool of base classifiers to be deep-copied. > > > > > > > > I analyzed this issue and I could not find a solution that does not > require changes on the scikit-learn code. Here is the sequence of steps > that cause the problem: > > > > > > > > GridSearchCV calls "clone" on the DES estimator (link) > > > > The clone function calls the "get_params" function of the DES > estimator (link, line 60). We don't re-implement this function, so it gets > all the parameters, including the pool of classifiers (at this point, they > are still "fitted") > > > > The clone function then clones each parameter with safe=False (line > 62). When cloning the pool of classifiers, the result is a pool that is not > "fitted" anymore. > > > > > > > > The problem is that, to my knowledge, there is no way for my > classifier to inform "clone" that a parameter should be always deep copied. > I see that other ensemble methods in sklearn always fit the base > classifiers within the "fit" method of the ensemble, so this problem does > not happen there. I would like to know if there is a solution for this > problem while having the base classifiers fitted elsewhere. > > > > > > > > Here is a short code that reproduces the issue: > > > > > > > > --------------------------- > > > > > > > > from sklearn.model_selection import GridSearchCV, train_test_split > > > > from sklearn.base import BaseEstimator, ClassifierMixin > > > > from sklearn.ensemble import BaggingClassifier > > > > from sklearn.datasets import load_iris > > > > > > > > > > > > class MyClassifier(BaseEstimator, ClassifierMixin): > > > > def __init__(self, base_classifiers, k): > > > > self.base_classifiers = base_classifiers # Base classifiers > that are already trained > > > > self.k = k # Simulate a parameter that we want to do a grid > search on > > > > > > > > def fit(self, X_dsel, y_dsel): > > > > pass # Here we would fit any parameters for the Dynamic > selection method, not the base classifiers > > > > > > > > def predict(self, X): > > > > return self.base_classifiers.predict(X) # In practice the > methods would do something with the predictions of each classifier > > > > > > > > > > > > X, y = load_iris(return_X_y=True) > > > > X_train, X_dsel, y_train, y_dsel = train_test_split(X, y, > test_size=0.5) > > > > > > > > base_classifiers = BaggingClassifier() > > > > base_classifiers.fit(X_train, y_train) > > > > > > > > clf = MyClassifier(base_classifiers, k=1) > > > > > > > > params = {'k': [1, 3, 5, 7]} > > > > grid = GridSearchCV(clf, params) > > > > > > > > grid.fit(X_dsel, y_dsel) # Raises error that the bagging > classifiers are not fitted > > > > > > > > --------------------------- > > > > > > > > Btw, here is the branch that we are using to make the library > compatible with sklearn: > https://github.com/Menelau/DESlib/tree/sklearn-estimators. The failing > test related to this issue is in > https://github.com/Menelau/DESlib/blob/sklearn-estimators/deslib/tests/test_des_integration.py#L36 > > > > > > > > Thanks in advance for any help on this case, > > > > > > > > Luiz Gustavo Hafemann > > > > > > > > _______________________________________________ > > > > scikit-learn mailing list > > > > scikit-learn at python.org > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > > > -- > > > Guillaume Lemaitre > > > INRIA Saclay - Parietal team > > > Center for Data Science Paris-Saclay > > > https://glemaitre.github.io/ > > > > > > > > -- > > Guillaume Lemaitre > > INRIA Saclay - Parietal team > > Center for Data Science Paris-Saclay > > https://glemaitre.github.io/ > > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 30, Issue 14 > ******************************************** > -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Wed Sep 19 18:17:59 2018 From: g.lemaitre58 at gmail.com (=?ISO-8859-1?Q?Guillaume_Lema=EEtre?=) Date: Thu, 20 Sep 2018 00:17:59 +0200 Subject: [scikit-learn] Issues with clone for ensemble of, classifiers In-Reply-To: Message-ID: An HTML attachment was scrubbed... URL: From daniel.saxton at gmail.com Thu Sep 20 09:01:36 2018 From: daniel.saxton at gmail.com (Daniel Saxton) Date: Thu, 20 Sep 2018 08:01:36 -0500 Subject: [scikit-learn] Bootstrapping in sklearn In-Reply-To: References: Message-ID: Olivier, I got in touch with Constantine from the scikits-bootstrap package and he's interested in merging the two projects. If we were to get some documentation together, do you feel that there is potential for inclusion as an sklearn-contrib package? I believe we would have most of the other requirements (testing, continuous integration, etc.), but is there anything else that you feel is missing? Thanks, Daniel On Tue, Sep 18, 2018 at 2:42 AM Olivier Grisel wrote: > This looks like a very useful project. > > There is also scikits-bootstraps [1]. Personally I prefer the flat package > namespace of resample (I am not a fan of the 'scikits' namespace package) > but I still think it would be great to contact the author to know if he > would be interested in joining efforts. > > What currently lacks from both projects is a good sphinx-based > documentation that explains in a couple of paragraphs with examples what > are the different non-parametric inference methods, what are the pros and > cons for each of them (sample complexity, computation complexity, kinds of > inference, bias, theoretical asymptotic results, practical discrepancies > observed in the finite sample setting, assumptions made on the distribution > of the data...) and ideally the doc would have reference to examples (using > sphinx-gallery) that would highlight the behavior of the tools in both > nominal and pathological cases. > > [1] https://github.com/cgevans/scikits-bootstrap > > -- > Olivier > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Thu Sep 20 12:54:04 2018 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Thu, 20 Sep 2018 18:54:04 +0200 Subject: [scikit-learn] Bootstrapping in sklearn In-Reply-To: References: Message-ID: I believe it would fit in sklearn-contrib even if it's more for statistical inference rather than machine learning style prediction. Others might disagree. Anyways, joining efforts to improve documentation, CI, testing and so on is always a good thing for your future users. -- Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.saxton at gmail.com Sun Sep 23 14:33:49 2018 From: daniel.saxton at gmail.com (Daniel Saxton) Date: Sun, 23 Sep 2018 13:33:49 -0500 Subject: [scikit-learn] Bootstrapping in sklearn In-Reply-To: References: Message-ID: Thanks, Olivier. We will try adding examples to show how it can be used in conjunction with sklearn to generate confidence intervals on linear model parameters, as well as prediction intervals for other classes of models. On Thu, Sep 20, 2018, 11:55 AM Olivier Grisel wrote: > I believe it would fit in sklearn-contrib even if it's more for > statistical inference rather than machine learning style prediction. > > Others might disagree. > > Anyways, joining efforts to improve documentation, CI, testing and so on > is always a good thing for your future users. > > -- > Olivier > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Sep 26 12:28:53 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 26 Sep 2018 12:28:53 -0400 Subject: [scikit-learn] Issues with clone for ensemble of, classifiers In-Reply-To: References: Message-ID: <7e4d0bcb-f412-ab3c-1282-9a5662eb01b4@gmail.com> Yes, I actually mentioned that on the roadmap thread. It should definitely be added. On 09/19/2018 06:17 PM, Guillaume Lema?tre wrote: > Actually I don't see anything mentioning it in the road map currently. > Should it be added? > > Sent from my phone - sorry to be brief and potential misspell. > > *From:* luiz.gh at gmail.com > *Sent:* 19 September 2018 7:12 pm > *To:* scikit-learn at python.org > *Reply to:* scikit-learn at python.org > *Subject:* Re: [scikit-learn] Issues with clone for ensemble of, > classifiers > > > Guillaume - thank you for the comments. Indeed, an approach to > "freeze" a fitted classifier would solve our problem. The Github issue > seems to be inactive for a while, but I will check if anyone else is > working on it. > > Luiz Gustavo > > > On Wed, Sep 19, 2018 at 12:02 PM > wrote: > > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > ? ?1. Re: Issues with clone for ensemble of classifiers > ? ? ? (Guillaume Lema?tre) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 19 Sep 2018 17:38:46 +0200 > From: Guillaume Lema?tre > > To: Scikit-learn user and developer mailing list > ? ? ? ? > > Subject: Re: [scikit-learn] Issues with clone for ensemble of > ? ? ? ? classifiers > Message-ID: > ? ? ? ? > > > Content-Type: text/plain; charset="UTF-8" > > However, there is some issue to frozen a fitted classifier. You > can refer to: > > https://github.com/scikit-learn/scikit-learn/issues/8370 > > with the associated discussion. > On Wed, 19 Sep 2018 at 17:34, Guillaume Lema?tre > > wrote: > > > > Ups I misread your comment. I don't think that we have currently a > > mechanism to avoid cloning classifier internally. > > On Wed, 19 Sep 2018 at 17:31, Guillaume Lema?tre > > wrote: > > > > > > You don't have anywhere in your class MyClassifier where you are > > > calling base_classifier.fit (...) > therefore when calling > > > base_classifier.predict (...) it > will let you know that you did not fit > > > it. > > > > > > On Wed, 19 Sep 2018 at 16:43, Luiz Gustavo Hafemann > > wrote: > > > > > > > > Hello, > > > > > > > > I am one of the developers of a library for Dynamic Ensemble > Selection (DES) methods (the library is called DESlib), and we are > currently working to get the library fully compatible with > scikit-learn (to submit it to scikit-learn-contrib). We have > "check_estimator" working for most of the classes, but now I am > having problems to make the classes compatible with GridSearch / > other CV functions. > > > > > > > > One of the main use cases of this library is to facilitate > research on this field, and this led to a design decision that the > base classifiers are fit by the user, and the DES methods receive > a pool of base classifiers that were already fit (this allow users > to compare many DES techniques with the same base classifiers). > This is creating an issue with GridSearch, since the clone method > (defined in sklearn.base ) is not cloning the > classes as we would like. It does a shallow (non-deep) copy of the > parameters, but we would like the pool of base classifiers to be > deep-copied. > > > > > > > > I analyzed this issue and I could not find a solution that > does not require changes on the scikit-learn code. Here is the > sequence of steps that cause the problem: > > > > > > > > GridSearchCV calls "clone" on the DES estimator (link) > > > > The clone function calls the "get_params" function of the > DES estimator (link, line 60). We don't re-implement this > function, so it gets all the parameters, including the pool of > classifiers (at this point, they are still "fitted") > > > > The clone function then clones each parameter with > safe=False (line 62). When cloning the pool of classifiers, the > result is a pool that is not "fitted" anymore. > > > > > > > > The problem is that, to my knowledge, there is no way for my > classifier to inform "clone" that a parameter should be always > deep copied. I see that other ensemble methods in sklearn always > fit the base classifiers within the "fit" method of the ensemble, > so this problem does not happen there. I would like to know if > there is a solution for this problem while having the base > classifiers fitted elsewhere. > > > > > > > > Here is a short code that reproduces the issue: > > > > > > > > --------------------------- > > > > > > > > from sklearn.model_selection import GridSearchCV, > train_test_split > > > > from sklearn.base import > BaseEstimator, ClassifierMixin > > > > from sklearn.ensemble import > BaggingClassifier > > > > from sklearn.datasets import load_iris > > > > > > > > > > > > class MyClassifier(BaseEstimator, ClassifierMixin): > > > >? ? ?def __init__(self, base_classifiers, k): > > > >? ? ? ? ?self.base_classifiers = base_classifiers? # Base > classifiers that are already trained > > > >? ? ? ? ?self.k = k? # Simulate a parameter that we want to > do a grid search on > > > > > > > >? ? ?def fit(self, X_dsel, y_dsel): > > > >? ? ? ? ?pass? # Here we would fit any parameters for the > Dynamic selection method, not the base classifiers > > > > > > > >? ? ?def predict(self, X): > > > >? ? ? ? ?return self.base_classifiers.predict > (X) # In practice the methods would do > something with the predictions of each classifier > > > > > > > > > > > > X, y = load_iris(return_X_y=True) > > > > X_train, X_dsel, y_train, y_dsel = train_test_split(X, y, > test_size=0.5) > > > > > > > > base_classifiers = BaggingClassifier() > > > > base_classifiers.fit (X_train, y_train) > > > > > > > > clf = MyClassifier(base_classifiers, k=1) > > > > > > > > params = {'k': [1, 3, 5, 7]} > > > > grid = GridSearchCV(clf, params) > > > > > > > > grid.fit (X_dsel, y_dsel)? # Raises error > that the bagging classifiers are not fitted > > > > > > > > --------------------------- > > > > > > > > Btw, here is the branch that we are using to make the > library compatible with sklearn: > https://github.com/Menelau/DESlib/tree/sklearn-estimators. The > failing test related to this issue is in > https://github.com/Menelau/DESlib/blob/sklearn-estimators/deslib/tests/test_des_integration.py#L36 > > > > > > > > Thanks in advance for any help on this case, > > > > > > > > Luiz Gustavo Hafemann > > > > > > > > _______________________________________________ > > > > scikit-learn mailing list > > > > scikit-learn at python.org > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > > > -- > > > Guillaume Lemaitre > > > INRIA Saclay - Parietal team > > > Center for Data Science Paris-Saclay > > > https://glemaitre.github.io/ > > > > > > > > -- > > Guillaume Lemaitre > > INRIA Saclay - Parietal team > > Center for Data Science Paris-Saclay > > https://glemaitre.github.io/ > > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 30, Issue 14 > ******************************************** > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Sep 26 14:55:57 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 26 Sep 2018 14:55:57 -0400 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 Message-ID: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> Hey everbody! I'm happy to (finally) announce scikit-learn 0.20.0. This release is dedicated to the memory of Raghav Rajagopalan. You can upgrade now with pip or conda! There is many important additions and updates, and you can find the full release notes here: http://scikit-learn.org/stable/whats_new.html#version-0-20 My personal highlights are the ColumnTransformer and the changes to OneHotEncoder, but there's so much more! An important note is that this is the last version to support Python2.7, and the next release will require Python 3.5. A big thank you to everybody who contributed and special thanks to Joel! All the best, Andy From bertrand.thirion at inria.fr Wed Sep 26 15:39:17 2018 From: bertrand.thirion at inria.fr (bthirion) Date: Wed, 26 Sep 2018 21:39:17 +0200 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> Message-ID: Congratulations ! Bertrand On 26/09/2018 20:55, Andreas Mueller wrote: > Hey everbody! > I'm happy to (finally) announce scikit-learn 0.20.0. > This release is dedicated to the memory of Raghav Rajagopalan. > > You can upgrade now with pip or conda! > > There is many important additions and updates, and you can find the full > release notes here: > http://scikit-learn.org/stable/whats_new.html#version-0-20 > > My personal highlights are the ColumnTransformer and the changes to > OneHotEncoder, > but there's so much more! > > An important note is that this is the last version to support > Python2.7, and the > next release will require Python 3.5. > > A big thank you to everybody who contributed and special thanks to Joel! > > All the best, > Andy > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From raga.markely at gmail.com Wed Sep 26 15:54:40 2018 From: raga.markely at gmail.com (Raga Markely) Date: Wed, 26 Sep 2018 15:54:40 -0400 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> Message-ID: Congratulations! Thank you very much for everyone's hard work! Raga On Wed, Sep 26, 2018, 2:57 PM Andreas Mueller wrote: > Hey everbody! > I'm happy to (finally) announce scikit-learn 0.20.0. > This release is dedicated to the memory of Raghav Rajagopalan. > > You can upgrade now with pip or conda! > > There is many important additions and updates, and you can find the full > release notes here: > http://scikit-learn.org/stable/whats_new.html#version-0-20 > > My personal highlights are the ColumnTransformer and the changes to > OneHotEncoder, > but there's so much more! > > An important note is that this is the last version to support Python2.7, > and the > next release will require Python 3.5. > > A big thank you to everybody who contributed and special thanks to Joel! > > All the best, > Andy > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Wed Sep 26 16:49:52 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 27 Sep 2018 06:49:52 +1000 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> Message-ID: Wow. It's finally out!! Thank you to the cast of thousands, but to also some individuals for real dedication and insight! Yet there's so much more still in the pipeline. If we're clever about things, we'll make the next release cycle shorter and the release more manageable. -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Sep 26 16:56:44 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 26 Sep 2018 16:56:44 -0400 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> Message-ID: On 09/26/2018 04:49 PM, Joel Nothman wrote: > Wow. It's finally out!! Thank you to the cast of thousands, but to > also some individuals for real dedication and insight! > > Yet there's so much more still in the pipeline. If we're clever about > things, we'll make the next release cycle shorter and the release more > manageable. > There's always so much more :) And yes, we should strive to cut down our release cycle (significantly). Let's see if we manage. From joel.nothman at gmail.com Wed Sep 26 16:59:47 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 27 Sep 2018 06:59:47 +1000 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> Message-ID: And for those interested in what's in the pipeline, we are trying to draft a roadmap... https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 But there are no doubt many features that are absent there too. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Wed Sep 26 17:44:29 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Wed, 26 Sep 2018 23:44:29 +0200 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> Message-ID: <20180926214429.xqjta3tolnvlm2fx@phare.normalesup.org> Hurray, thanks to everybody; in particular for those who did the hard work of ironing out the last issues and releasing. Ga?l On Wed, Sep 26, 2018 at 02:55:57PM -0400, Andreas Mueller wrote: > Hey everbody! > I'm happy to (finally) announce scikit-learn 0.20.0. > This release is dedicated to the memory of Raghav Rajagopalan. > You can upgrade now with pip or conda! > There is many important additions and updates, and you can find the full > release notes here: > http://scikit-learn.org/stable/whats_new.html#version-0-20 > My personal highlights are the ColumnTransformer and the changes to > OneHotEncoder, > but there's so much more! > An important note is that this is the last version to support Python2.7, and > the > next release will require Python 3.5. > A big thank you to everybody who contributed and special thanks to Joel! > All the best, > Andy > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From denis.engemann at gmail.com Thu Sep 27 01:48:33 2018 From: denis.engemann at gmail.com (Denis-Alexander Engemann) Date: Thu, 27 Sep 2018 07:48:33 +0200 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: <20180926214429.xqjta3tolnvlm2fx@phare.normalesup.org> References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> <20180926214429.xqjta3tolnvlm2fx@phare.normalesup.org> Message-ID: This is wonderful news! Congrats everyone. I can?t wait to check out the game changing column transformer! Denis On Wed 26 Sep 2018 at 23:45, Gael Varoquaux wrote: > Hurray, thanks to everybody; in particular for those who did the hard > work of ironing out the last issues and releasing. > > Ga?l > > On Wed, Sep 26, 2018 at 02:55:57PM -0400, Andreas Mueller wrote: > > Hey everbody! > > I'm happy to (finally) announce scikit-learn 0.20.0. > > This release is dedicated to the memory of Raghav Rajagopalan. > > > You can upgrade now with pip or conda! > > > There is many important additions and updates, and you can find the full > > release notes here: > > http://scikit-learn.org/stable/whats_new.html#version-0-20 > > > My personal highlights are the ColumnTransformer and the changes to > > OneHotEncoder, > > but there's so much more! > > > An important note is that this is the last version to support Python2.7, > and > > the > > next release will require Python 3.5. > > > A big thank you to everybody who contributed and special thanks to Joel! > > > All the best, > > Andy > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > -- > Gael Varoquaux > Senior Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ntbaovn at gmail.com Thu Sep 27 01:59:04 2018 From: ntbaovn at gmail.com (Aiden Nguyen) Date: Thu, 27 Sep 2018 12:59:04 +0700 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> <20180926214429.xqjta3tolnvlm2fx@phare.normalesup.org> Message-ID: Congrat all team! Aiden Nguyen -- Nguyen Thien Bao, PhD Director and Founder, HBB Tech, Vietnam Co-founder, HBB Solutions, Vietnam Head, R&D Division, Cardano Labo, Vietnam NeuroInformatics Laboratory (NILab), Fondazione Bruno Kessler (FBK), Trento, Italy Centro Interdipartimentale Mente e Cervello (CIMeC), Universita degli Studi di Trento, Italy Surgical Planning Laboratory (SPL), Department of Radiology, BWH, Harvard University, MA, USA Lecturer, Faculty of Information Technology, University of Technology and Education, Ho Chi Minh, Vietnam Email: bao at bwh.harvard.edu or tbnguyen at fbk.eu or baont at hbbsolution.com or ntbaovn at gmail.com Fax: +39.0461.283.091 Cellphone: +1. 857.265.6408 (USA) +39.345.293.1006 (Italy) +84.9.2761.3761 (VietNam) On Thu, Sep 27, 2018 at 12:49 PM Denis-Alexander Engemann < denis.engemann at gmail.com> wrote: > This is wonderful news! Congrats everyone. I can?t wait to check out the > game changing column transformer! > > Denis > On Wed 26 Sep 2018 at 23:45, Gael Varoquaux > wrote: > >> Hurray, thanks to everybody; in particular for those who did the hard >> work of ironing out the last issues and releasing. >> >> Ga?l >> >> On Wed, Sep 26, 2018 at 02:55:57PM -0400, Andreas Mueller wrote: >> > Hey everbody! >> > I'm happy to (finally) announce scikit-learn 0.20.0. >> > This release is dedicated to the memory of Raghav Rajagopalan. >> >> > You can upgrade now with pip or conda! >> >> > There is many important additions and updates, and you can find the full >> > release notes here: >> > http://scikit-learn.org/stable/whats_new.html#version-0-20 >> >> > My personal highlights are the ColumnTransformer and the changes to >> > OneHotEncoder, >> > but there's so much more! >> >> > An important note is that this is the last version to support >> Python2.7, and >> > the >> > next release will require Python 3.5. >> >> > A big thank you to everybody who contributed and special thanks to Joel! >> >> > All the best, >> > Andy >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> -- >> Gael Varoquaux >> Senior Researcher, INRIA Parietal >> NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France >> Phone: ++ 33-1-69-08-79-68 >> http://gael-varoquaux.info >> http://twitter.com/GaelVaroquaux >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Thu Sep 27 04:27:11 2018 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Thu, 27 Sep 2018 10:27:11 +0200 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> <20180926214429.xqjta3tolnvlm2fx@phare.normalesup.org> Message-ID: Joy ! -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Thu Sep 27 04:29:03 2018 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Thu, 27 Sep 2018 10:29:03 +0200 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> Message-ID: Le mer. 26 sept. 2018 ? 23:02, Joel Nothman a ?crit : > And for those interested in what's in the pipeline, we are trying to draft > a roadmap... > https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 > > But there are no doubt many features that are absent there too. > Indeed, it would be great to get some feedback on this roadmap from heavy scikit-learn users: which points do you think are the most important? What is missing from this roadmap? Feel free to reply to this thread. -- Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Thu Sep 27 13:29:22 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 27 Sep 2018 13:29:22 -0400 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> Message-ID: <97cdc854-03d5-dd83-9e6e-e5dc6bc78e75@gmail.com> I think we should work on the formatting, make sure it's complete, link it to issues /PRs and then make this into a public document on the website and request feedback. Right now it's a bit in a format that is understandable for core-developers but some of the things are not clear to the average audience. Linking the issues / PRs will help that a bit, but also we might want to add a sentence to each point in the roadmap. I had some issues with the formatting, I'll try to fix that later. Any volunteers for adding the frozen estimator (or has someone added that already?). Cheers, Andy On 09/27/2018 04:29 AM, Olivier Grisel wrote: > Le?mer. 26 sept. 2018 ??23:02, Joel Nothman > a ?crit?: > > And for those interested in what's in the pipeline, we are trying > to draft a roadmap... > https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 > > But there are no doubt many features that?are absent there too. > > > Indeed, it would be great to get some feedback on this roadmap from > heavy scikit-learn users: which points do you think are the most > important? What is missing from this roadmap? > > Feel free to reply to this thread. > > -- > Olivier > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlopez at ende.cc Thu Sep 27 19:22:07 2018 From: jlopez at ende.cc (=?UTF-8?Q?Javier_L=C3=B3pez?=) Date: Fri, 28 Sep 2018 00:22:07 +0100 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: <97cdc854-03d5-dd83-9e6e-e5dc6bc78e75@gmail.com> References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> <97cdc854-03d5-dd83-9e6e-e5dc6bc78e75@gmail.com> Message-ID: First of all, congratulations on the release, great work, everyone! I think model serialization should be a priority. Particularly, I think that (whenever practical) there should be a way of serializing estimators (either unfitted or fitted) in a text-readable format, prefereably JSON or PMML/PFA (or several others). Obviously for some models it is not practical (eg random forests with thousands of trees), but for simpler situations I believe it would provide a great tool for model sharing without the dangers of pickling and the versioning hell. I am (painfully) aware that when rebuilding a model on a different setup, it might yield different results; in my company we address that by saving together with the serialized model a reasonably small validation dataset together with its predictions, upon unserializing we check that the rebuilt model reproduces the predictions within some acceptable range. About the new release, I am particularly happy about the joblib update, as it has been a major source of pain for me over the last year. On that note, I think it would be a good idea to stop vendoring joblib and list it as a dependency instead; wheels, pip and conda are mature enough to handle the situation nowadays. Last, but not least, it would be great to relax the checks concerning nans at prediction time, and allow, for instance, that an estimator yields nans if any features are nan's; we face that situation when working with ensembles, where a few of the submodels might not get enough features available, but the rest do. Of the top of my head, that's all, keep up the fantastic work! J On Thu, Sep 27, 2018 at 6:31 PM Andreas Mueller wrote: > I think we should work on the formatting, make sure it's complete, link it > to issues /PRs and > then make this into a public document on the website and request feedback. > > Right now it's a bit in a format that is understandable for > core-developers but some of the things are not clear > to the average audience. Linking the issues / PRs will help that a bit, > but also we might want to add a sentence > to each point in the roadmap. > > I had some issues with the formatting, I'll try to fix that later. > Any volunteers for adding the frozen estimator (or has someone added that > already?). > > Cheers, > Andy > > > On 09/27/2018 04:29 AM, Olivier Grisel wrote: > > Le mer. 26 sept. 2018 ? 23:02, Joel Nothman a > ?crit : > >> And for those interested in what's in the pipeline, we are trying to >> draft a roadmap... >> https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 >> >> But there are no doubt many features that are absent there too. >> > > Indeed, it would be great to get some feedback on this roadmap from heavy > scikit-learn users: which points do you think are the most important? What > is missing from this roadmap? > > Feel free to reply to this thread. > > -- > Olivier > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Thu Sep 27 19:37:15 2018 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Thu, 27 Sep 2018 18:37:15 -0500 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> <97cdc854-03d5-dd83-9e6e-e5dc6bc78e75@gmail.com> Message-ID: <9E87E48D-0E1F-46A1-8FF6-28B0449E65D2@sebastianraschka.com> Congrats everyone, this is awesome!!! I just started teaching an ML course this semester and introduced scikit-learn this week -- it was a great timing to demonstrate how well maintained the library is and praise all the efforts that go into it :). > I think model serialization should be a priority. While this could potentially a bit inefficient for large non-parametric models, I think the serialization into a text-readable format has some advantages for real-world use cases. E.g., sharing models (pickle is a bit problematic because of security issues) in applications but also as supplementary material in archives for accompanying research articles, etc (esp in cases where datasets cannot be shared in their original form due to some copyright or other concerns). Chris Emmery, Chris Wagner and I toyed around with JSON a while back (https://cmry.github.io/notes/serialize), and it could be feasible -- but yeah, it will involve some work, especially with testing things thoroughly for all kinds of estimators. Maybe this could somehow be automated though in a grid-search kind of way with a build matrix for estimators and parameters once a general framework has been developed. > On Sep 27, 2018, at 6:22 PM, Javier L?pez wrote: > > First of all, congratulations on the release, great work, everyone! > > I think model serialization should be a priority. Particularly, > I think that (whenever practical) there should be a way of > serializing estimators (either unfitted or fitted) in a text-readable format, > prefereably JSON or PMML/PFA (or several others). > > Obviously for some models it is not practical (eg random forests with > thousands of trees), but for simpler situations I believe it would > provide a great tool for model sharing without the dangers of pickling > and the versioning hell. > > I am (painfully) aware that when rebuilding a model on a different setup, > it might yield different results; in my company we address that by saving > together with the serialized model a reasonably small validation dataset > together with its predictions, upon unserializing we check that the rebuilt > model reproduces the predictions within some acceptable range. > > About the new release, I am particularly happy about the joblib update, > as it has been a major source of pain for me over the last year. On that > note, I think it would be a good idea to stop vendoring joblib and list it as > a dependency instead; wheels, pip and conda are mature enough to > handle the situation nowadays. > > Last, but not least, it would be great to relax the checks concerning nans > at prediction time, and allow, for instance, that an estimator yields nans if > any features are nan's; we face that situation when working with ensembles, > where a few of the submodels might not get enough features available, but > the rest do. > > Of the top of my head, that's all, keep up the fantastic work! > J > > On Thu, Sep 27, 2018 at 6:31 PM Andreas Mueller wrote: > I think we should work on the formatting, make sure it's complete, link it to issues /PRs and > then make this into a public document on the website and request feedback. > > Right now it's a bit in a format that is understandable for core-developers but some of the things are not clear > to the average audience. Linking the issues / PRs will help that a bit, but also we might want to add a sentence > to each point in the roadmap. > > I had some issues with the formatting, I'll try to fix that later. > Any volunteers for adding the frozen estimator (or has someone added that already?). > > Cheers, > Andy > > > On 09/27/2018 04:29 AM, Olivier Grisel wrote: >> Le mer. 26 sept. 2018 ? 23:02, Joel Nothman a ?crit : >> And for those interested in what's in the pipeline, we are trying to draft a roadmap... https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 >> >> But there are no doubt many features that are absent there too. >> >> Indeed, it would be great to get some feedback on this roadmap from heavy scikit-learn users: which points do you think are the most important? What is missing from this roadmap? >> >> Feel free to reply to this thread. >> >> -- >> Olivier >> >> >> _______________________________________________ >> scikit-learn mailing list >> >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From gael.varoquaux at normalesup.org Fri Sep 28 03:37:41 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Fri, 28 Sep 2018 09:37:41 +0200 Subject: [scikit-learn] Full time jobs to work on scikit-learn in Paris Message-ID: <20180928073741.vxfwcrnckcagdrsh@phare.normalesup.org> Dear list, I am very happy to announce that the Inria foundation is looking to hire two persons to work with the scikit-learn in France: * One Community and Operation Officer: https://scikit-learn.fondation-inria.fr/job_coo/ We need a good mix of communication, organizational, and technical skills to help the team and the community work best together * One Performance and Quality Engineer: https://scikit-learn.fondation-inria.fr/en/job_performance/ We need someone who care about tests, continuous integration and performance, to help making scikit-learn faster will guaranteeing that it stays as solid as it is. Please forward this announcement to anyone who might be interested. Best, Ga?l From jlopez at ende.cc Fri Sep 28 05:47:52 2018 From: jlopez at ende.cc (=?UTF-8?Q?Javier_L=C3=B3pez?=) Date: Fri, 28 Sep 2018 10:47:52 +0100 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: <9E87E48D-0E1F-46A1-8FF6-28B0449E65D2@sebastianraschka.com> References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> <97cdc854-03d5-dd83-9e6e-e5dc6bc78e75@gmail.com> <9E87E48D-0E1F-46A1-8FF6-28B0449E65D2@sebastianraschka.com> Message-ID: On Fri, Sep 28, 2018 at 1:03 AM Sebastian Raschka wrote: > Chris Emmery, Chris Wagner and I toyed around with JSON a while back ( > https://cmry.github.io/notes/serialize), and it could be feasible I came across your notes a while back, they were really useful! I hacked a variation of it that didn't need to know the model class in advance: https://gist.github.com/jlopezpena/2cdd09c56afda5964990d5cf278bfd31 but is is VERY hackish, and it doesn't work with complex models with nested components. (At work we use a further variation of this that also works on pipelines and some specific nested stuff, like `mlxtend`'s `SequentialFeatureSelector`) > but yeah, it will involve some work, especially with testing things > thoroughly for all kinds of estimators. Maybe this could somehow be > automated though in a grid-search kind of way with a build matrix for > estimators and parameters once a general framework has been developed. > I considered making this serialization into an external project, but I think this would be much easier if estimators provided a dunder method `__serialize__` (or whatever) that would handle the idiosyncrasies of each particular family, I don't believe there will be a "one-size-fits-all" solution for this problem. This approach would also make it possible to work on it incrementally, raising a default `NotImplementedError` for estimators that haven't been addressed yet. In the long run, I also believe that the "proper" way to do this is to allow dumping entire processes into PFA: http://dmg.org/pfa/docs/motivation/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Fri Sep 28 10:30:10 2018 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Fri, 28 Sep 2018 16:30:10 +0200 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: <9E87E48D-0E1F-46A1-8FF6-28B0449E65D2@sebastianraschka.com> References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> <97cdc854-03d5-dd83-9e6e-e5dc6bc78e75@gmail.com> <9E87E48D-0E1F-46A1-8FF6-28B0449E65D2@sebastianraschka.com> Message-ID: > > > > I think model serialization should be a priority. > There is also the ONNX specification that is gaining industrial adoption and that already includes open source exporters for several families of scikit-learn models: https://github.com/onnx/onnxmltools -- Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: From mcasl at unileon.es Fri Sep 28 10:46:04 2018 From: mcasl at unileon.es (=?UTF-8?Q?Manuel_CASTEJ=C3=93N_LIMAS?=) Date: Fri, 28 Sep 2018 16:46:04 +0200 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> Message-ID: Huge huge Thank you developers! Keep up the good work! El mi?., 26 sept. 2018 20:57, Andreas Mueller escribi?: > Hey everbody! > I'm happy to (finally) announce scikit-learn 0.20.0. > This release is dedicated to the memory of Raghav Rajagopalan. > > You can upgrade now with pip or conda! > > There is many important additions and updates, and you can find the full > release notes here: > http://scikit-learn.org/stable/whats_new.html#version-0-20 > > My personal highlights are the ColumnTransformer and the changes to > OneHotEncoder, > but there's so much more! > > An important note is that this is the last version to support Python2.7, > and the > next release will require Python 3.5. > > A big thank you to everybody who contributed and special thanks to Joel! > > All the best, > Andy > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Fri Sep 28 12:10:50 2018 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Fri, 28 Sep 2018 11:10:50 -0500 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> <97cdc854-03d5-dd83-9e6e-e5dc6bc78e75@gmail.com> <9E87E48D-0E1F-46A1-8FF6-28B0449E65D2@sebastianraschka.com> Message-ID: > > > I think model serialization should be a priority. > > There is also the ONNX specification that is gaining industrial adoption and that already includes open source exporters for several families of scikit-learn models: > > https://github.com/onnx/onnxmltools Didn't know about that. This is really nice! What do you think about referring to it under http://scikit-learn.org/stable/modules/model_persistence.html to make people aware that this option exists? Would be happy to add a PR. Best, Sebastian > On Sep 28, 2018, at 9:30 AM, Olivier Grisel wrote: > > > > I think model serialization should be a priority. > > There is also the ONNX specification that is gaining industrial adoption and that already includes open source exporters for several families of scikit-learn models: > > https://github.com/onnx/onnxmltools > > -- > Olivier > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From t3kcit at gmail.com Fri Sep 28 13:38:39 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 28 Sep 2018 13:38:39 -0400 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> <97cdc854-03d5-dd83-9e6e-e5dc6bc78e75@gmail.com> <9E87E48D-0E1F-46A1-8FF6-28B0449E65D2@sebastianraschka.com> Message-ID: <96edd381-2352-f183-486a-b86e395a78f6@gmail.com> On 09/28/2018 12:10 PM, Sebastian Raschka wrote: >>> I think model serialization should be a priority. >> There is also the ONNX specification that is gaining industrial adoption and that already includes open source exporters for several families of scikit-learn models: >> >> https://github.com/onnx/onnxmltools > > Didn't know about that. This is really nice! What do you think about referring to it under http://scikit-learn.org/stable/modules/model_persistence.html to make people aware that this option exists? > Would be happy to add a PR. > > I don't think an open source runtime has been announced yet (or they didn't email me like they promised lol). I'm quite excited about this as well. Javier: The problem is not so much storing the "model" but storing how to make predictions. Different versions could act differently on the same data structure - and the data structure could change. Both happen in scikit-learn. So if you want to make sure the right thing happens across versions, you either need to provide serialization and deserialization for every version and conversion between those or you need to provide a way to store the prediction function, which basically means you need a turing-complete language (that's what ONNX does). We basically said doing the first is not feasible within scikit-learn given our current amount of resources, and no-one has even tried doing it outside of scikit-learn (which would be possible). Implementing a complete prediction serialization language (the second option) is definitely outside the scope of sklearn. From t3kcit at gmail.com Fri Sep 28 13:41:13 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 28 Sep 2018 13:41:13 -0400 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: <96edd381-2352-f183-486a-b86e395a78f6@gmail.com> References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> <97cdc854-03d5-dd83-9e6e-e5dc6bc78e75@gmail.com> <9E87E48D-0E1F-46A1-8FF6-28B0449E65D2@sebastianraschka.com> <96edd381-2352-f183-486a-b86e395a78f6@gmail.com> Message-ID: <4cfbb327-7489-70ff-8fa3-a21079ec0068@gmail.com> On 09/28/2018 01:38 PM, Andreas Mueller wrote: > > > On 09/28/2018 12:10 PM, Sebastian Raschka wrote: >>>> I think model serialization should be a priority. >>> There is also the ONNX specification that is gaining industrial >>> adoption and that already includes open source exporters for several >>> families of scikit-learn models: >>> >>> https://github.com/onnx/onnxmltools >> >> Didn't know about that. This is really nice! What do you think about >> referring to it under >> http://scikit-learn.org/stable/modules/model_persistence.html to make >> people aware that this option exists? >> Would be happy to add a PR. >> >> > I don't think an open source runtime has been announced yet (or they > didn't email me like they promised lol). > I'm quite excited about this as well. > > Javier: > The problem is not so much storing the "model" but storing how to make > predictions. Different versions could act differently > on the same data structure - and the data structure could change. Both > happen in scikit-learn. > So if you want to make sure the right thing happens across versions, > you either need to provide serialization and deserialization for > every version and conversion between those or you need to provide a > way to store the prediction function, > which basically means you need a turing-complete language (that's what > ONNX does). > > We basically said doing the first is not feasible within scikit-learn > given our current amount of resources, and no-one > has even tried doing it outside of scikit-learn (which would be > possible). > Implementing a complete prediction serialization language (the second > option) is definitely outside the scope of sklearn. > > Maybe we should add to the FAQ why serialization is hard? From mcasl at unileon.es Fri Sep 28 14:34:43 2018 From: mcasl at unileon.es (=?UTF-8?Q?Manuel_CASTEJ=C3=93N_LIMAS?=) Date: Fri, 28 Sep 2018 20:34:43 +0200 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: <4cfbb327-7489-70ff-8fa3-a21079ec0068@gmail.com> References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> <97cdc854-03d5-dd83-9e6e-e5dc6bc78e75@gmail.com> <9E87E48D-0E1F-46A1-8FF6-28B0449E65D2@sebastianraschka.com> <96edd381-2352-f183-486a-b86e395a78f6@gmail.com> <4cfbb327-7489-70ff-8fa3-a21079ec0068@gmail.com> Message-ID: How about a docker based approach? Just thinking out loud Best Manuel El vie., 28 sept. 2018 19:43, Andreas Mueller escribi?: > > > On 09/28/2018 01:38 PM, Andreas Mueller wrote: > > > > > > On 09/28/2018 12:10 PM, Sebastian Raschka wrote: > >>>> I think model serialization should be a priority. > >>> There is also the ONNX specification that is gaining industrial > >>> adoption and that already includes open source exporters for several > >>> families of scikit-learn models: > >>> > >>> https://github.com/onnx/onnxmltools > >> > >> Didn't know about that. This is really nice! What do you think about > >> referring to it under > >> http://scikit-learn.org/stable/modules/model_persistence.html to make > >> people aware that this option exists? > >> Would be happy to add a PR. > >> > >> > > I don't think an open source runtime has been announced yet (or they > > didn't email me like they promised lol). > > I'm quite excited about this as well. > > > > Javier: > > The problem is not so much storing the "model" but storing how to make > > predictions. Different versions could act differently > > on the same data structure - and the data structure could change. Both > > happen in scikit-learn. > > So if you want to make sure the right thing happens across versions, > > you either need to provide serialization and deserialization for > > every version and conversion between those or you need to provide a > > way to store the prediction function, > > which basically means you need a turing-complete language (that's what > > ONNX does). > > > > We basically said doing the first is not feasible within scikit-learn > > given our current amount of resources, and no-one > > has even tried doing it outside of scikit-learn (which would be > > possible). > > Implementing a complete prediction serialization language (the second > > option) is definitely outside the scope of sklearn. > > > > > Maybe we should add to the FAQ why serialization is hard? > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlopez at ende.cc Fri Sep 28 15:20:04 2018 From: jlopez at ende.cc (=?UTF-8?Q?Javier_L=C3=B3pez?=) Date: Fri, 28 Sep 2018 20:20:04 +0100 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: <96edd381-2352-f183-486a-b86e395a78f6@gmail.com> References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> <97cdc854-03d5-dd83-9e6e-e5dc6bc78e75@gmail.com> <9E87E48D-0E1F-46A1-8FF6-28B0449E65D2@sebastianraschka.com> <96edd381-2352-f183-486a-b86e395a78f6@gmail.com> Message-ID: On Fri, Sep 28, 2018 at 6:41 PM Andreas Mueller wrote: > Javier: > The problem is not so much storing the "model" but storing how to make > predictions. Different versions could act differently > on the same data structure - and the data structure could change. Both > happen in scikit-learn. > So if you want to make sure the right thing happens across versions, you > either need to provide serialization and deserialization for > every version and conversion between those or you need to provide a way > to store the prediction function, > which basically means you need a turing-complete language (that's what > ONNX does). > I understand the difficulty of the situation, but an approximate solution to that is saving the predictions from a large enough validation set. If the prediction for the newly created model are "close enough" to the old ones, we deem the unserialized model to be the same and move forward, if there are serious discrepancies, then we dive deep to see what's going on, and if needed refit the offending submodels with the newer version. Since we only want to compare the predictions here, we don't need a ground truth and thus the validation set doesn't even need to be a real dataset, it can consist of synthetic datapoints created via SMOTE, Caruana's MUNGE algorithm, or any other method, and can be made arbitrarily large on in advance. This method has worked reasonably well for us in practice; we deal with ensembles containing hundreds or thousands of models, and this technique saves us from having to refit many of them that don't change very often, and if something changes a lot, we want to know in either case to ascertain what was amiss (either with the old version or with the new one). The situation I am proposing is not worse than what we have right now, which is save a pickle and then hope that it can be read later on; sometimes it can, sometimes it cannot depending on what changed. Stuff unrelated to the models themselves, such as changes in the joblib dump method broke several of our pickles files in the past. What I would like to have is a text-based representation of the fitted model that can always be read, stored in a database, or sent over the wire through simple methods. J -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Sep 28 15:42:48 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 28 Sep 2018 15:42:48 -0400 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> <97cdc854-03d5-dd83-9e6e-e5dc6bc78e75@gmail.com> <9E87E48D-0E1F-46A1-8FF6-28B0449E65D2@sebastianraschka.com> <96edd381-2352-f183-486a-b86e395a78f6@gmail.com> Message-ID: On 09/28/2018 03:20 PM, Javier L?pez wrote: > I understand the difficulty of the situation, but an approximate > solution to that is saving the predictions from a large enough > validation set. If the prediction for the newly created model are > "close enough" to the old ones, we deem the unserialized model to be > the same and move forward, if there are serious discrepancies, then we > dive deep to see what's going on, and if needed refit the offending > submodels with the newer version. Basically what you're saying is that you're fine with versioning the models and having the model break loudly if anything changes. That's not actually what most people want. They want to be able to make predictions with a given model for ever into the future. Your use-case is similar, but if retraining the model is not an issue, why don't you want to retrain every time scikit-learn releases a new version? We're now storing the version of scikit-learn that was used in the pickle and warn if you're trying to load with a different version. That's basically a stricter test than what you wanted. Yes, there are false positives, but given that this release took a year, this doesn't seem that big an issue? From jlopez at ende.cc Fri Sep 28 16:45:16 2018 From: jlopez at ende.cc (=?UTF-8?Q?Javier_L=C3=B3pez?=) Date: Fri, 28 Sep 2018 21:45:16 +0100 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> <97cdc854-03d5-dd83-9e6e-e5dc6bc78e75@gmail.com> <9E87E48D-0E1F-46A1-8FF6-28B0449E65D2@sebastianraschka.com> <96edd381-2352-f183-486a-b86e395a78f6@gmail.com> Message-ID: On Fri, Sep 28, 2018 at 8:46 PM Andreas Mueller wrote: > Basically what you're saying is that you're fine with versioning the > models and having the model break loudly if anything changes. > That's not actually what most people want. They want to be able to make > predictions with a given model for ever into the future. > Are we talking about "(the new version of) the old model can still make predictions" or "the old model makes exactly the same predictions as before"? I'd like the first to hold, don't care that much about the second. > > Your use-case is similar, but if retraining the model is not an issue, > why don't you want to retrain every time scikit-learn releases a new > version? > Thousands of models. I don't want to retrain ALL of them unless needed > We're now storing the version of scikit-learn that was used in the > pickle and warn if you're trying to load with a different version. This is not the whole truth. Yes, you store the sklearn version on the pickle and raise a warning; I am mostly ok with that, but the pickles are brittle and oftentimes they stop loading when other versions of other stuff change. I am not talking about "Warning: wrong version", but rather "Unpickling error: expected bytes, found tuple" that prevent the file from loading entirely. > That's basically a stricter test than what you wanted. Yes, there are > false positives, but given that this release took a year, > this doesn't seem that big an issue? > 1. Things in the current state break when something else changes, not only sklearn. 2. Sharing pickles is a bad practice due to a number of reasons. 3. We might want to explore model parameters without having to load the entire runtime Also, in order to retrain the model we need to keep the whole model description with parameters. This needs to be saved somewhere, which in the current state would force us to keep two files: one with the parameters (in a text format to avoid the "non-loadng" problems from above) and the pkl with the fitted model. My proposal would keep both in a single file. As mentioned in previous emails, we already have our own solution that kind-of-works for our needs, but we have to do a few hackish things to keep things running. If sklearn estimators simply included a text serialization method (similar in spirit to the one used for __display__ or __repr__) it would make things easier. But I understand that not everyone's needs are the same, so if you guys don't consider this type of thing a priority, we can live with that :) I mostly mentioned it since "Backwards-compatible de/serialization of some estimators" is listed in the roadmap as a desirable goal for version 1.0 and feedback on such roadmap was requested. J -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Sep 28 17:17:54 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 28 Sep 2018 17:17:54 -0400 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> <97cdc854-03d5-dd83-9e6e-e5dc6bc78e75@gmail.com> <9E87E48D-0E1F-46A1-8FF6-28B0449E65D2@sebastianraschka.com> <96edd381-2352-f183-486a-b86e395a78f6@gmail.com> Message-ID: <55553cf0-4080-d2dd-1879-27e06505751b@gmail.com> On 09/28/2018 04:45 PM, Javier L?pez wrote: > > > On Fri, Sep 28, 2018 at 8:46 PM Andreas Mueller > wrote: > > Basically what you're saying is that you're fine with versioning the > models and having the model break loudly if anything changes. > That's not actually what most people want. They want to be able to > make > predictions with a given model for ever into the future. > > > Are we talking about "(the new version of) the old model can still > make predictions" or "the old model makes exactly the same predictions > as before"? I'd like the first to hold, don't care that much about the > second. The second. > > We're now storing the version of scikit-learn that was used in the > pickle and warn if you're trying to load with a different version. > > > This is not the whole truth. Yes, you store the sklearn version on the > pickle and raise a warning; I am mostly ok with that, but the pickles > are brittle and oftentimes they stop loading when other versions of > other stuff change. I am not talking about "Warning: wrong version", > but rather "Unpickling error: expected bytes, found tuple" that > prevent the file from loading entirely. Can you give examples of that? That shouldn't really happen afaik. > > That's basically a stricter test than what you wanted. Yes, there are > false positives, but given that this release took a year, > this doesn't seem that big an issue? > > > 1. Things in the current state break when something else changes, not > only sklearn. > 2. Sharing pickles is a bad practice due to a number of reasons. > 3. We might want to explore model parameters without having to load > the entire runtime > > I agree, it would be great to have something other than pickle, but as I said, the usual request is "I want a way for a model to make the same predictions in the future". If you have a way to do that with a text-based format that doesn't require writing lots of version converters I'd be very happy. Generally, what you want is not to store the model but to store the prediction function, and have separate runtimes for training and prediction. It might not be possible to represent a model from a previous version of scikit-learn in a newer version. -------------- next part -------------- An HTML attachment was scrubbed... URL: From awnystrom at gmail.com Sat Sep 29 02:34:38 2018 From: awnystrom at gmail.com (Andrew Nystrom) Date: Fri, 28 Sep 2018 23:34:38 -0700 Subject: [scikit-learn] Explaining pull request #12197 - Fast PolynomialFeatures on CSR matrices Message-ID: Hi scikit-learn community. I've making a pull request to add a method that allows for polynomial features to be computed on compressed sparse row (CSR) matrices directly. It takes advantage of data sparsity, only taking products of nonzero features, resulting in a speedup proportional to the density to the power of the degree of the expansion. The method is laid out in this work . This yields an impressive speedups, as seen in this plot: [image: blippity.png] This is for degree = 2, and the effects become even more pronounced for degree = 3. All the lines for the dense method are overlapping each other. This is because the dense method doesn't leverage sparsity. Also, notice that the CSR algorithm approaches the time of the dense algorithm as the density approaches 1.0. There's even a slight speedup for the fully dense case, which I attribute to the fact that the code's in Cython. One might wonder why I didn't include the times for compressed sparse column (CSC) matrices, which PolynomialFeatures does support. The reason is that this is very, very slow. In fact, it's slower than just passing a dense matrix. I ran a single trial with a 100 x 500 matrix with a density of 0.5. For degree=2, the CSC algorithm took 56.88 seconds while the CSR algorithm took 0.1363. I tried checking for degree=3, but gave up after waiting for what seemed like 20 minutes for the CSC method, but the CSR method finished in 35.38 seconds. The only caveat is that the algorithm requires a function to be derived for each degree. I've done this, as laid out in the paper, for degrees 2 and 3. I don't think people will typically even want higher degrees due to the explosion of the size of the feature space, but I did lay out the pattern for deriving the functions, just in case. Test results: [image: Screen Shot 2018-09-28 at 10.34.19 PM.png] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: blippity.png Type: image/png Size: 65149 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2018-09-28 at 10.34.19 PM.png Type: image/png Size: 503007 bytes Desc: not available URL: