From 2jasonsanchez at gmail.com Mon Oct 1 21:11:52 2018 From: 2jasonsanchez at gmail.com (Jason Sanchez) Date: Mon, 1 Oct 2018 18:11:52 -0700 Subject: [scikit-learn] scikit-learn Digest, Vol 30, Issue 25 In-Reply-To: References: Message-ID: The current roadmap is amazing. One feature that would be exciting is better support for multilayer stacking with caching and the ability to add models to already trained layers. I saw this history: https://github.com/scikit-learn/scikit-learn/pull/8960 This library is very close: * API is somewhat awkward, but otherwise good. Does not cache intermediate steps. https://wolpert.readthedocs.io/en/latest/index.html These solutions seem to allow only two layers: * https://github.com/scikit-learn/scikit-learn/issues/4816#issuecomment-217817717 * https://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/ * https://github.com/scikit-learn/scikit-learn/pull/6674 The people who put these other libraries together have made an incredibly welcome effort to solve a real need and it would be amazing to see a payoff for their effort in the form of an addition of stacking to scikit-learn's core library. As another data point, I attached a simple implementation I put together to illustrate what I think are core needs of this feature. Feel free to browse the code. Here is the short list: * Infinite layers (or at least 3 ;) ) * Choice of CV or OOB for each model * Ability to add a new model to a layer after the stacked ensemble has been trained and refit the pipeline such that only models that must be retrained are retrained (i.e. train the added model and retrain all models in higher layers) * All standard scikit-learn pipeline goodness (introspection, grid search, serializability, etc) Thanks all! This library is making a real difference for good in the lives of many people. Jason On Fri, Sep 28, 2018 at 11:35 AM wrote: > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Re: [ANN] Scikit-learn 0.20.0 (Sebastian Raschka) > 2. Re: [ANN] Scikit-learn 0.20.0 (Andreas Mueller) > 3. Re: [ANN] Scikit-learn 0.20.0 (Andreas Mueller) > 4. Re: [ANN] Scikit-learn 0.20.0 (Manuel CASTEJ?N LIMAS) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 28 Sep 2018 11:10:50 -0500 > From: Sebastian Raschka > To: Scikit-learn mailing list > Subject: Re: [scikit-learn] [ANN] Scikit-learn 0.20.0 > Message-ID: > > Content-Type: text/plain; charset=us-ascii > > > > > > I think model serialization should be a priority. > > > > There is also the ONNX specification that is gaining industrial adoption > and that already includes open source exporters for several families of > scikit-learn models: > > > > https://github.com/onnx/onnxmltools > > > Didn't know about that. This is really nice! What do you think about > referring to it under > http://scikit-learn.org/stable/modules/model_persistence.html to make > people aware that this option exists? > Would be happy to add a PR. > > Best, > Sebastian > > > > > On Sep 28, 2018, at 9:30 AM, Olivier Grisel > wrote: > > > > > > > I think model serialization should be a priority. > > > > There is also the ONNX specification that is gaining industrial adoption > and that already includes open source exporters for several families of > scikit-learn models: > > > > https://github.com/onnx/onnxmltools > > > > -- > > Olivier > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > ------------------------------ > > Message: 2 > Date: Fri, 28 Sep 2018 13:38:39 -0400 > From: Andreas Mueller > To: scikit-learn at python.org > Subject: Re: [scikit-learn] [ANN] Scikit-learn 0.20.0 > Message-ID: <96edd381-2352-f183-486a-b86e395a78f6 at gmail.com> > Content-Type: text/plain; charset=utf-8; format=flowed > > > > On 09/28/2018 12:10 PM, Sebastian Raschka wrote: > >>> I think model serialization should be a priority. > >> There is also the ONNX specification that is gaining industrial > adoption and that already includes open source exporters for several > families of scikit-learn models: > >> > >> https://github.com/onnx/onnxmltools > > > > Didn't know about that. This is really nice! What do you think about > referring to it under > http://scikit-learn.org/stable/modules/model_persistence.html to make > people aware that this option exists? > > Would be happy to add a PR. > > > > > I don't think an open source runtime has been announced yet (or they > didn't email me like they promised lol). > I'm quite excited about this as well. > > Javier: > The problem is not so much storing the "model" but storing how to make > predictions. Different versions could act differently > on the same data structure - and the data structure could change. Both > happen in scikit-learn. > So if you want to make sure the right thing happens across versions, you > either need to provide serialization and deserialization for > every version and conversion between those or you need to provide a way > to store the prediction function, > which basically means you need a turing-complete language (that's what > ONNX does). > > We basically said doing the first is not feasible within scikit-learn > given our current amount of resources, and no-one > has even tried doing it outside of scikit-learn (which would be possible). > Implementing a complete prediction serialization language (the second > option) is definitely outside the scope of sklearn. > > > > > ------------------------------ > > Message: 3 > Date: Fri, 28 Sep 2018 13:41:13 -0400 > From: Andreas Mueller > To: scikit-learn at python.org > Subject: Re: [scikit-learn] [ANN] Scikit-learn 0.20.0 > Message-ID: <4cfbb327-7489-70ff-8fa3-a21079ec0068 at gmail.com> > Content-Type: text/plain; charset=utf-8; format=flowed > > > > On 09/28/2018 01:38 PM, Andreas Mueller wrote: > > > > > > On 09/28/2018 12:10 PM, Sebastian Raschka wrote: > >>>> I think model serialization should be a priority. > >>> There is also the ONNX specification that is gaining industrial > >>> adoption and that already includes open source exporters for several > >>> families of scikit-learn models: > >>> > >>> https://github.com/onnx/onnxmltools > >> > >> Didn't know about that. This is really nice! What do you think about > >> referring to it under > >> http://scikit-learn.org/stable/modules/model_persistence.html to make > >> people aware that this option exists? > >> Would be happy to add a PR. > >> > >> > > I don't think an open source runtime has been announced yet (or they > > didn't email me like they promised lol). > > I'm quite excited about this as well. > > > > Javier: > > The problem is not so much storing the "model" but storing how to make > > predictions. Different versions could act differently > > on the same data structure - and the data structure could change. Both > > happen in scikit-learn. > > So if you want to make sure the right thing happens across versions, > > you either need to provide serialization and deserialization for > > every version and conversion between those or you need to provide a > > way to store the prediction function, > > which basically means you need a turing-complete language (that's what > > ONNX does). > > > > We basically said doing the first is not feasible within scikit-learn > > given our current amount of resources, and no-one > > has even tried doing it outside of scikit-learn (which would be > > possible). > > Implementing a complete prediction serialization language (the second > > option) is definitely outside the scope of sklearn. > > > > > Maybe we should add to the FAQ why serialization is hard? > > > ------------------------------ > > Message: 4 > Date: Fri, 28 Sep 2018 20:34:43 +0200 > From: Manuel CASTEJ?N LIMAS > To: Scikit-learn user and developer mailing list > > Subject: Re: [scikit-learn] [ANN] Scikit-learn 0.20.0 > Message-ID: > UFntYo02YkR9YwrCjicb8A3cutpN47L4MYZWxeNNYP+1A at mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > How about a docker based approach? Just thinking out loud > Best > Manuel > > El vie., 28 sept. 2018 19:43, Andreas Mueller escribi?: > > > > > > > On 09/28/2018 01:38 PM, Andreas Mueller wrote: > > > > > > > > > On 09/28/2018 12:10 PM, Sebastian Raschka wrote: > > >>>> I think model serialization should be a priority. > > >>> There is also the ONNX specification that is gaining industrial > > >>> adoption and that already includes open source exporters for several > > >>> families of scikit-learn models: > > >>> > > >>> https://github.com/onnx/onnxmltools > > >> > > >> Didn't know about that. This is really nice! What do you think about > > >> referring to it under > > >> http://scikit-learn.org/stable/modules/model_persistence.html to make > > >> people aware that this option exists? > > >> Would be happy to add a PR. > > >> > > >> > > > I don't think an open source runtime has been announced yet (or they > > > didn't email me like they promised lol). > > > I'm quite excited about this as well. > > > > > > Javier: > > > The problem is not so much storing the "model" but storing how to make > > > predictions. Different versions could act differently > > > on the same data structure - and the data structure could change. Both > > > happen in scikit-learn. > > > So if you want to make sure the right thing happens across versions, > > > you either need to provide serialization and deserialization for > > > every version and conversion between those or you need to provide a > > > way to store the prediction function, > > > which basically means you need a turing-complete language (that's what > > > ONNX does). > > > > > > We basically said doing the first is not feasible within scikit-learn > > > given our current amount of resources, and no-one > > > has even tried doing it outside of scikit-learn (which would be > > > possible). > > > Implementing a complete prediction serialization language (the second > > > option) is definitely outside the scope of sklearn. > > > > > > > > Maybe we should add to the FAQ why serialization is hard? > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://mail.python.org/pipermail/scikit-learn/attachments/20180928/f52258e8/attachment.html > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 30, Issue 25 > ******************************************** > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Stacking (2).ipynb Type: application/octet-stream Size: 28943 bytes Desc: not available URL: From mcasl at unileon.es Tue Oct 2 04:11:56 2018 From: mcasl at unileon.es (=?UTF-8?Q?Manuel_CASTEJ=C3=93N_LIMAS?=) Date: Tue, 2 Oct 2018 10:11:56 +0200 Subject: [scikit-learn] scikit-learn Digest, Vol 30, Issue 25 In-Reply-To: References:

Message-ID: I would propose PipeGraph for stacking, it comes natural and it could help a lot in making things easier for core developers. Disclaimer: I'm coauthor of PipeGraph Manuel Castej?n Limas Escuela de Ingenier?as Industrial, Inform?tica y Aeroespacial Universidad de Le?n Campus de Vegazana sn. 24071. Le?n. Spain. e-mail: manuel.castejon at unileon.es Tel.: +34 987 291 779 Aviso de confidencialidad Confidentiality Notice El mar., 2 oct. 2018 a las 3:13, Jason Sanchez (<2jasonsanchez at gmail.com>) escribi?: > The current roadmap is amazing. One feature that would be exciting is > better support for multilayer stacking with caching and the ability to add > models to already trained layers. > > I saw this history: https://github.com/scikit-learn/scikit-learn/pull/8960 > > This library is very close: > * API is somewhat awkward, but otherwise good. Does not cache intermediate > steps. https://wolpert.readthedocs.io/en/latest/index.html > > These solutions seem to allow only two layers: > * > https://github.com/scikit-learn/scikit-learn/issues/4816#issuecomment-217817717 > * > https://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/ > * https://github.com/scikit-learn/scikit-learn/pull/6674 > > The people who put these other libraries together have made an incredibly > welcome effort to solve a real need and it would be amazing to see a payoff > for their effort in the form of an addition of stacking to scikit-learn's > core library. > > As another data point, I attached a simple implementation I put together > to illustrate what I think are core needs of this feature. Feel free to > browse the code. Here is the short list: > * Infinite layers (or at least 3 ;) ) > * Choice of CV or OOB for each model > * Ability to add a new model to a layer after the stacked ensemble has > been trained and refit the pipeline such that only models that must be > retrained are retrained (i.e. train the added model and retrain all models > in higher layers) > * All standard scikit-learn pipeline goodness (introspection, grid search, > serializability, etc) > > Thanks all! This library is making a real difference for good in the lives > of many people. > > Jason > > > On Fri, Sep 28, 2018 at 11:35 AM wrote: > >> Send scikit-learn mailing list submissions to >> scikit-learn at python.org >> >> To subscribe or unsubscribe via the World Wide Web, visit >> https://mail.python.org/mailman/listinfo/scikit-learn >> or, via email, send a message with subject or body 'help' to >> scikit-learn-request at python.org >> >> You can reach the person managing the list at >> scikit-learn-owner at python.org >> >> When replying, please edit your Subject line so it is more specific >> than "Re: Contents of scikit-learn digest..." >> >> >> Today's Topics: >> >> 1. Re: [ANN] Scikit-learn 0.20.0 (Sebastian Raschka) >> 2. Re: [ANN] Scikit-learn 0.20.0 (Andreas Mueller) >> 3. Re: [ANN] Scikit-learn 0.20.0 (Andreas Mueller) >> 4. Re: [ANN] Scikit-learn 0.20.0 (Manuel CASTEJ?N LIMAS) >> >> >> ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Fri, 28 Sep 2018 11:10:50 -0500 >> From: Sebastian Raschka >> To: Scikit-learn mailing list >> Subject: Re: [scikit-learn] [ANN] Scikit-learn 0.20.0 >> Message-ID: >> >> Content-Type: text/plain; charset=us-ascii >> >> > >> > > I think model serialization should be a priority. >> > >> > There is also the ONNX specification that is gaining industrial >> adoption and that already includes open source exporters for several >> families of scikit-learn models: >> > >> > https://github.com/onnx/onnxmltools >> >> >> Didn't know about that. This is really nice! What do you think about >> referring to it under >> http://scikit-learn.org/stable/modules/model_persistence.html to make >> people aware that this option exists? >> Would be happy to add a PR. >> >> Best, >> Sebastian >> >> >> >> > On Sep 28, 2018, at 9:30 AM, Olivier Grisel >> wrote: >> > >> > >> > > I think model serialization should be a priority. >> > >> > There is also the ONNX specification that is gaining industrial >> adoption and that already includes open source exporters for several >> families of scikit-learn models: >> > >> > https://github.com/onnx/onnxmltools >> > >> > -- >> > Olivier >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> ------------------------------ >> >> Message: 2 >> Date: Fri, 28 Sep 2018 13:38:39 -0400 >> From: Andreas Mueller >> To: scikit-learn at python.org >> Subject: Re: [scikit-learn] [ANN] Scikit-learn 0.20.0 >> Message-ID: <96edd381-2352-f183-486a-b86e395a78f6 at gmail.com> >> Content-Type: text/plain; charset=utf-8; format=flowed >> >> >> >> On 09/28/2018 12:10 PM, Sebastian Raschka wrote: >> >>> I think model serialization should be a priority. >> >> There is also the ONNX specification that is gaining industrial >> adoption and that already includes open source exporters for several >> families of scikit-learn models: >> >> >> >> https://github.com/onnx/onnxmltools >> > >> > Didn't know about that. This is really nice! What do you think about >> referring to it under >> http://scikit-learn.org/stable/modules/model_persistence.html to make >> people aware that this option exists? >> > Would be happy to add a PR. >> > >> > >> I don't think an open source runtime has been announced yet (or they >> didn't email me like they promised lol). >> I'm quite excited about this as well. >> >> Javier: >> The problem is not so much storing the "model" but storing how to make >> predictions. Different versions could act differently >> on the same data structure - and the data structure could change. Both >> happen in scikit-learn. >> So if you want to make sure the right thing happens across versions, you >> either need to provide serialization and deserialization for >> every version and conversion between those or you need to provide a way >> to store the prediction function, >> which basically means you need a turing-complete language (that's what >> ONNX does). >> >> We basically said doing the first is not feasible within scikit-learn >> given our current amount of resources, and no-one >> has even tried doing it outside of scikit-learn (which would be possible). >> Implementing a complete prediction serialization language (the second >> option) is definitely outside the scope of sklearn. >> >> >> >> >> ------------------------------ >> >> Message: 3 >> Date: Fri, 28 Sep 2018 13:41:13 -0400 >> From: Andreas Mueller >> To: scikit-learn at python.org >> Subject: Re: [scikit-learn] [ANN] Scikit-learn 0.20.0 >> Message-ID: <4cfbb327-7489-70ff-8fa3-a21079ec0068 at gmail.com> >> Content-Type: text/plain; charset=utf-8; format=flowed >> >> >> >> On 09/28/2018 01:38 PM, Andreas Mueller wrote: >> > >> > >> > On 09/28/2018 12:10 PM, Sebastian Raschka wrote: >> >>>> I think model serialization should be a priority. >> >>> There is also the ONNX specification that is gaining industrial >> >>> adoption and that already includes open source exporters for several >> >>> families of scikit-learn models: >> >>> >> >>> https://github.com/onnx/onnxmltools >> >> >> >> Didn't know about that. This is really nice! What do you think about >> >> referring to it under >> >> http://scikit-learn.org/stable/modules/model_persistence.html to make >> >> people aware that this option exists? >> >> Would be happy to add a PR. >> >> >> >> >> > I don't think an open source runtime has been announced yet (or they >> > didn't email me like they promised lol). >> > I'm quite excited about this as well. >> > >> > Javier: >> > The problem is not so much storing the "model" but storing how to make >> > predictions. Different versions could act differently >> > on the same data structure - and the data structure could change. Both >> > happen in scikit-learn. >> > So if you want to make sure the right thing happens across versions, >> > you either need to provide serialization and deserialization for >> > every version and conversion between those or you need to provide a >> > way to store the prediction function, >> > which basically means you need a turing-complete language (that's what >> > ONNX does). >> > >> > We basically said doing the first is not feasible within scikit-learn >> > given our current amount of resources, and no-one >> > has even tried doing it outside of scikit-learn (which would be >> > possible). >> > Implementing a complete prediction serialization language (the second >> > option) is definitely outside the scope of sklearn. >> > >> > >> Maybe we should add to the FAQ why serialization is hard? >> >> >> ------------------------------ >> >> Message: 4 >> Date: Fri, 28 Sep 2018 20:34:43 +0200 >> From: Manuel CASTEJ?N LIMAS >> To: Scikit-learn user and developer mailing list >> >> Subject: Re: [scikit-learn] [ANN] Scikit-learn 0.20.0 >> Message-ID: >> > UFntYo02YkR9YwrCjicb8A3cutpN47L4MYZWxeNNYP+1A at mail.gmail.com> >> Content-Type: text/plain; charset="utf-8" >> >> How about a docker based approach? Just thinking out loud >> Best >> Manuel >> >> El vie., 28 sept. 2018 19:43, Andreas Mueller >> escribi?: >> >> > >> > >> > On 09/28/2018 01:38 PM, Andreas Mueller wrote: >> > > >> > > >> > > On 09/28/2018 12:10 PM, Sebastian Raschka wrote: >> > >>>> I think model serialization should be a priority. >> > >>> There is also the ONNX specification that is gaining industrial >> > >>> adoption and that already includes open source exporters for several >> > >>> families of scikit-learn models: >> > >>> >> > >>> https://github.com/onnx/onnxmltools >> > >> >> > >> Didn't know about that. This is really nice! What do you think about >> > >> referring to it under >> > >> http://scikit-learn.org/stable/modules/model_persistence.html to >> make >> > >> people aware that this option exists? >> > >> Would be happy to add a PR. >> > >> >> > >> >> > > I don't think an open source runtime has been announced yet (or they >> > > didn't email me like they promised lol). >> > > I'm quite excited about this as well. >> > > >> > > Javier: >> > > The problem is not so much storing the "model" but storing how to make >> > > predictions. Different versions could act differently >> > > on the same data structure - and the data structure could change. Both >> > > happen in scikit-learn. >> > > So if you want to make sure the right thing happens across versions, >> > > you either need to provide serialization and deserialization for >> > > every version and conversion between those or you need to provide a >> > > way to store the prediction function, >> > > which basically means you need a turing-complete language (that's what >> > > ONNX does). >> > > >> > > We basically said doing the first is not feasible within scikit-learn >> > > given our current amount of resources, and no-one >> > > has even tried doing it outside of scikit-learn (which would be >> > > possible). >> > > Implementing a complete prediction serialization language (the second >> > > option) is definitely outside the scope of sklearn. >> > > >> > > >> > Maybe we should add to the FAQ why serialization is hard? >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> -------------- next part -------------- >> An HTML attachment was scrubbed... >> URL: < >> http://mail.python.org/pipermail/scikit-learn/attachments/20180928/f52258e8/attachment.html >> > >> >> ------------------------------ >> >> Subject: Digest Footer >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> ------------------------------ >> >> End of scikit-learn Digest, Vol 30, Issue 25 >> ******************************************** >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alex at garel.org Tue Oct 2 09:28:05 2018 From: alex at garel.org (Alex Garel) Date: Tue, 2 Oct 2018 14:28:05 +0100 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com> Message-ID: Le 26/09/2018 ? 21:59, Joel Nothman a ?crit?: > And for those interested in what's in the pipeline, we are trying to > draft a > roadmap...?https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 Hello, First of all thanks for the incredible work on scikit-learn. I found the RoadMap quite cool and in line with some of my own concerns. In particular : * "Make it easier for external users to write Scikit-learn-compatible components" - really a great goal to have a stable ecosystem * "Passing around information that is not (X, y)" - faced it. * "Better interface for interactive development" (wow - very feature - such cool - how many great !) * Improved tracking of fitting (cool for early stopping while doing hyper parameter search, or simply testing some model in a notebook) However, here are some aspect that I, modestly, would like to see (also maybe for some of them there is work in progress or external lib, let me know): * chunk processing (kind of handling streaming data) :? when dealing with lot of data, the ability to fit_partial, then use transform on chunks of data is of good help. But it's not well exposed in current doc and API, and a lot of models do not support it, while they could. Also pipeline does not support fit_partial and there is not fit_transform_partial. * while handling "Passing around information that is not (X, y)", is there any plan to have transform being able to transform X and y ? This would ease lots of problems like subsampling, resampling or masking data when too incomplete. In my case for example, while transforming words to vectors, I may end with sentences full of out of vocabulary words, hence some sample I would like to let aside, but can't because I do not have hands on y. (and introducing it, make me loose my ability to use my precious pipeline). I think Python offers possibilities to handle the API change (for example we can have a new transform_xy method, and a compatibility transform using it until deprecation) Also I understand that changing the API is always a big deal. But I think scikit-learn, because of its API has played a good role in standardizing the python ML ecosystem and this is a key contribution. Not dealing with mature new needs and some of actual API initial flaws, may deserve whole community as new independent and inconsistent API will flourish as no project has the legitimity of scikit-learn. So courage :-) Also having good integrations to popular framework like keras or gensim, would be great (but the goal of third party packages of course). Of course writing all this, I don't want to sonud pedantic. I know I'm not so experimented with scikit-learn (nor did contribute to it), so take for what it is. Have a good day ! Alex -- Alexandre Garel tel : +33 7 68 52 69 07 / +213 656 11 85 10 skype: alexgarel / ring: ba0435e11af36e32e9b4eb13c19c52fd75c7b4b0 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 228 bytes Desc: OpenPGP digital signature URL: From michael_reupold at trimble.com Tue Oct 2 11:35:30 2018 From: michael_reupold at trimble.com (Michael Reupold) Date: Tue, 2 Oct 2018 17:35:30 +0200 Subject: [scikit-learn] Splitting Method on RandomForestClassifier Message-ID: Hello all, I currently struggle to find information what or which specific split Methods are used on the RandomForestClassifier. Is it a random selection? A median? The best of a set of methods? Kind regards Michael Reupold -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Oct 2 11:46:01 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 2 Oct 2018 11:46:01 -0400 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com>

Message-ID: <42b1846e-0552-4d39-51dc-498cc24fd126@gmail.com> Thank you for your feedback Alex! On 10/02/2018 09:28 AM, Alex Garel wrote: > > * chunk processing (kind of handling streaming data) :? when dealing > with lot of data, the ability to fit_partial, then use transform > on chunks of data is of good help. But it's not well exposed in > current doc and API, > This has been discussed in the past, but it looks like no-one was excited enough about it to add it to the roadmap. This would require quite some additions to the API. Olivier, who has been quite interested in this before now seems to be more interested in integration with dask, which might achieve the same thing. > > * and a lot of models do not support it, while they could. > Can you give examples of that? > > * Also pipeline does not support fit_partial and there is not > fit_transform_partial. > What would you expect those to do? Each step in the pipeline might require passing over the whole dataset multiple times before being able to transform anything. That basically makes the current interface impossible to work with the pipeline. Even if only a single pass of the dataset was required, that wouldn't work with the current interface. If we would be handing around generators that allow to loop over the whole data, that would work. But it would be unclear how to support a streaming setting. > * while handling "Passing around information that is not (X, y)", is > there any plan to have transform being able to transform X and y ? > This would ease lots of problems like subsampling, resampling or > masking data when too incomplete. > An API for subsampling is on the roadmap :) > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Tue Oct 2 11:49:38 2018 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Tue, 2 Oct 2018 17:49:38 +0200 Subject: [scikit-learn] Splitting Method on RandomForestClassifier In-Reply-To: References: Message-ID: In Random Forest, the best split for each feature is selected. The Extra Randomized Trees will make a random split instead. On Tue, 2 Oct 2018 at 17:43, Michael Reupold wrote: > > Hello all, > I currently struggle to find information what or which specific split Methods are used on the RandomForestClassifier. Is it a random selection? A median? The best of a set of methods? > > Kind regards > > Michael Reupold > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ From t3kcit at gmail.com Tue Oct 2 11:56:06 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 2 Oct 2018 11:56:06 -0400 Subject: [scikit-learn] scikit-learn Digest, Vol 30, Issue 25 In-Reply-To: References:

Message-ID: <939d940d-ebf1-1a66-d993-301df5a65262@gmail.com> Thank you for your feedback! On 10/01/2018 09:11 PM, Jason Sanchez wrote: > The current roadmap is amazing. One feature that would be exciting is > better support for multilayer stacking with caching and the ability to > add models to already trained layers. > > I saw this history: https://github.com/scikit-learn/scikit-learn/pull/8960 > I think we still want to include something like this. I guess maybe it wasn't thought of as major enough to make the roadmap. The roadmap mostly has API changes and things that impact more than one estimator. This is "just" adding an estimator for the most part. > This library is very close: > * API is somewhat awkward, but otherwise good. Does not cache > intermediate steps. https://wolpert.readthedocs.io/en/latest/index.html If we reuse pipelines, we might get this "for free" to some degree. > > > As another data point, I attached a simple implementation I put > together to illustrate what I think are core needs of this feature. > Feel free to browse the code.?Here is the short list: > * Infinite layers (or at least 3 ;) ) Pretty sure that'll happen > * Choice of CV or OOB for each model This is less likely to happen in an initial version, I think. These two things have traditionally been very separate. We could potentially add to the roadmap to make this easier? (actually I just did) > * Ability to add a new model to a layer after the stacked ensemble has > been trained and refit the pipeline such that only models that must be > retrained are retrained (i.e. train the added model and retrain all > models in higher layers) This is the "freezing estimators" that's on the roadmap. > * All standard scikit-learn pipeline goodness (introspection, grid > search, serializability, etc) > That's a given for anything in sklearn ;) From gael.varoquaux at normalesup.org Tue Oct 2 12:01:41 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 2 Oct 2018 18:01:41 +0200 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: References: <97cdc854-03d5-dd83-9e6e-e5dc6bc78e75@gmail.com> <9E87E48D-0E1F-46A1-8FF6-28B0449E65D2@sebastianraschka.com>

<96edd381-2352-f183-486a-b86e395a78f6@gmail.com>

Message-ID: <20181002160141.tqhhypm423ce4nef@phare.normalesup.org> On Fri, Sep 28, 2018 at 09:45:16PM +0100, Javier L?pez wrote: > This is not the whole truth. Yes, you store the sklearn version on the pickle > and raise a warning; I am mostly ok with that, but the pickles are brittle and > oftentimes they stop loading when other versions of other stuff change. I am > not talking about "Warning: wrong version", but rather "Unpickling error: > expected bytes, found tuple" that prevent the file from loading entirely. > [...] > 1. Things in the current state break when something else changes, not only > sklearn. > 2. Sharing pickles is a bad practice due to a number of reasons. The reason that pickles are brittle and that sharing pickles is a bad practice is that pickle use an implicitly defined data model, which is defined via the internals of objects. The "right" solution is to use an explicit data model. This is for instance what is done with an object database. However, this comes at the cost of making it very hard to change objects. First, all objects must be stored with a schema (or language) that is rich enough to represent it, and yet defined somewhat explicitly (to avoid running into the problems of pickle). Second, if the internal representation of the object change, there needs to be explicit conversion code to go from one version to the next. Typically, upgrade of websites that use object database need maintainers to write this conversion code. So, the problems of pickle are not specific to pickle, but rather intrinsic to any generic persistence code [*]. Writing persistence code that does not fall in these problems is very costly in terms of developer time and makes it harder to add new methods or improve existing one. I am not excited about it. Rather, the good practice is that if you want to deploy model you deploy on the exact same environment that you have trained them. The web world is very used to doing that (because they keep falling in these problems), and has developed technology to do this, such as docker containers. I know that it is clunky technology. I don't like it myself, but I don't see a way out of it with our resources. Ga?l [*] Back in the days, when I was working on Mayavi, we developed our persistence code, because we were not happy with pickle. It was not pleasant to maintain, and had the same "smell" as pickle. I don't think that it was a great use of our time. From t3kcit at gmail.com Tue Oct 2 12:20:40 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 2 Oct 2018 12:20:40 -0400 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: <20181002160141.tqhhypm423ce4nef@phare.normalesup.org> References: <97cdc854-03d5-dd83-9e6e-e5dc6bc78e75@gmail.com> <9E87E48D-0E1F-46A1-8FF6-28B0449E65D2@sebastianraschka.com>

<96edd381-2352-f183-486a-b86e395a78f6@gmail.com>

<20181002160141.tqhhypm423ce4nef@phare.normalesup.org> Message-ID: <0160befc-7931-9988-4e0d-7cf27da9ba0c@gmail.com> On 10/02/2018 12:01 PM, Gael Varoquaux wrote: > > So, the problems of pickle are not specific to pickle, but rather > intrinsic to any generic persistence code [*]. Writing persistence code that > does not fall in these problems is very costly in terms of developer time > and makes it harder to add new methods or improve existing one. I am not > excited about it. > I think having solution is to have MS, FB, Amazon, IBM, Nvidia, intel,... maintain our generic persistent code is a decent deal for us */if/* it works out ;) https://onnx.ai/ (MS is providing sklearn to ONNX converters and is extending ONNX to allow for more sklearn estimators to be expressed in ONNX). Containers are a reasonable fallback, though. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Tue Oct 2 12:38:01 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 2 Oct 2018 18:38:01 +0200 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: <0160befc-7931-9988-4e0d-7cf27da9ba0c@gmail.com> References: <9E87E48D-0E1F-46A1-8FF6-28B0449E65D2@sebastianraschka.com>

<96edd381-2352-f183-486a-b86e395a78f6@gmail.com>

<20181002160141.tqhhypm423ce4nef@phare.normalesup.org> <0160befc-7931-9988-4e0d-7cf27da9ba0c@gmail.com> Message-ID: <20181002163801.ofmvkgls36pdkhbg@phare.normalesup.org> On Tue, Oct 02, 2018 at 12:20:40PM -0400, Andreas Mueller wrote: > I think having solution is to have MS, FB, Amazon, IBM, Nvidia, intel,... > maintain our generic persistent code is a decent deal for us if it works out ;) > https://onnx.ai/ I'll take that deal! :) +1 for onnx, absolutely! G From mail at sebastianraschka.com Tue Oct 2 14:23:00 2018 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Tue, 2 Oct 2018 13:23:00 -0500 Subject: [scikit-learn] Splitting Method on RandomForestClassifier In-Reply-To: References: Message-ID: This is explained here http://scikit-learn.org/stable/modules/ensemble.html#random-forests: "In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features." and the "best split" (in the decision trees) among the random feature subset is based on maximizing information gain or equivalently minimizing child node impurity as described here: http://scikit-learn.org/stable/modules/tree.html#mathematical-formulation Looking at this, I have a question though ... In the docs (http://scikit-learn.org/stable/modules/tree.html#mathematical-formulation) it says "Select the parameters that minimises the impurity" and "Recurse for subsets Q_left and Q_right until the maximum allowable depth is reached" So but this is basically not the whole definition, right? There should be condition that if the weighted average of the child node impurities for any given feature is not smaller than the parent node impurity, the tree growing algorithm would terminate, right? Best, Sebastian > On Oct 2, 2018, at 10:49 AM, Guillaume Lema?tre wrote: > > In Random Forest, the best split for each feature is selected. The > Extra Randomized Trees will make a random split instead. > On Tue, 2 Oct 2018 at 17:43, Michael Reupold > wrote: >> >> Hello all, >> I currently struggle to find information what or which specific split Methods are used on the RandomForestClassifier. Is it a random selection? A median? The best of a set of methods? >> >> Kind regards >> >> Michael Reupold >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From g.lemaitre58 at gmail.com Tue Oct 2 15:01:08 2018 From: g.lemaitre58 at gmail.com (=?ISO-8859-1?Q?Guillaume_Lema=EEtre?=) Date: Tue, 02 Oct 2018 21:01:08 +0200 Subject: [scikit-learn] Splitting Method on RandomForestClassifier In-Reply-To: Message-ID: This is driven by the parameter min_impurity_decrease. Sent from my phone - sorry to be brief and potential misspell. ? Original Message ? From: mail at sebastianraschka.com Sent: 2 October 2018 20:48 To: scikit-learn at python.org Reply to: scikit-learn at python.org Subject: Re: [scikit-learn] Splitting Method on RandomForestClassifier This is explained here http://scikit-learn.org/stable/modules/ensemble.html#random-forests: "In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features." and the "best split" (in the decision trees) among the random feature subset is based on maximizing information gain or equivalently minimizing child node impurity as described here: http://scikit-learn.org/stable/modules/tree.html#mathematical-formulation Looking at this, I have a question though ... In the docs (http://scikit-learn.org/stable/modules/tree.html#mathematical-formulation) it says "Select the parameters that minimises the impurity" and "Recurse for subsets Q_left and Q_right until the maximum allowable depth is reached" So but this is basically not the whole definition, right? There should be condition that if the weighted average of the child node impurities for any given feature is not smaller than the parent node impurity, the tree growing algorithm would terminate, right? Best, Sebastian > On Oct 2, 2018, at 10:49 AM, Guillaume Lema?tre wrote: > > In Random Forest, the best split for each feature is selected. The > Extra Randomized Trees will make a random split instead. > On Tue, 2 Oct 2018 at 17:43, Michael Reupold > wrote: >> >> Hello all, >> I currently struggle to find information what or which specific split Methods are used on the RandomForestClassifier. Is it a random selection? A median? The best of a set of methods? >> >> Kind regards >> >> Michael Reupold >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn From alex at garel.org Wed Oct 3 04:18:47 2018 From: alex at garel.org (Alex Garel) Date: Wed, 3 Oct 2018 09:18:47 +0100 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: <42b1846e-0552-4d39-51dc-498cc24fd126@gmail.com> References: <446defc2-55ee-90cd-a897-895fc4930fd8@gmail.com>

<42b1846e-0552-4d39-51dc-498cc24fd126@gmail.com> Message-ID: <693f287d-0afe-3691-a368-b0641601c091@garel.org> Le 02/10/2018 ? 16:46, Andreas Mueller a ?crit?: > Thank you for your feedback Alex! Thanks for answering ! > > On 10/02/2018 09:28 AM, Alex Garel wrote: >> >> * chunk processing (kind of handling streaming data) :? when >> dealing with lot of data, the ability to fit_partial, then use >> transform on chunks of data is of good help. But it's not well >> exposed in current doc and API, >> > This has been discussed in the past, but it looks like no-one was > excited enough about it to add it to the roadmap. > This would require quite some additions to the API. Olivier, who has > been quite interested in this before now seems > to be more interested in integration with dask, which might achieve > the same thing. I've tried to use Dask on my side, but for now, though going quite ahead, I didn't suceed completly because (in my specific case) of memory issues (dask default schedulers do not specialize processes on tasks, and I had some memory consuming tasks but I didn't get far enough to write my own scheduler). However I might deal with that later (not writing a scheduler but sharing memory with mmap, in this case). But yes Dask is about the "chunk instead of really streaming" approach (which was my point). >> * and a lot of models do not support it, while they could. >> > Can you give examples of that? Hum I spoke maybe too fast ! Greping the code give me some example at least, and it's true that a DecisionTree does not hold it naturally ! >> * Also pipeline does not support fit_partial and there is not >> fit_transform_partial. >> > What would you expect those to do? Each step in the pipeline might > require passing over the whole dataset multiple times > before being able to transform anything. That basically makes the > current interface impossible to work with the pipeline. > Even if only a single pass of the dataset was required, that wouldn't > work with the current interface. > If we would be handing around generators that allow to loop over the > whole data, that would work. But it would be unclear > how to support a streaming setting. You're right, I didn't think hard enough about it ! BTW I made some test using generators and making fit / transform build pipelines that I consumed latter on (tried with plain iterators and streamz). It did work somehow, with much hacks, but in my specific case, performance where not good enough. (real problem was not framework performance, but my architecture where I realize, that constantly re-generating data instead of doing it once was not fast enough). So finally my points were not so good, but at least I did learn something ;-) Thanks for your time. -- Alexandre Garel tel : +33 7 68 52 69 07 / +213 656 11 85 10 skype: alexgarel / ring: ba0435e11af36e32e9b4eb13c19c52fd75c7b4b0 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 228 bytes Desc: OpenPGP digital signature URL: From jlopez at ende.cc Wed Oct 3 06:49:10 2018 From: jlopez at ende.cc (=?UTF-8?Q?Javier_L=C3=B3pez?=) Date: Wed, 3 Oct 2018 11:49:10 +0100 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: <20181002160141.tqhhypm423ce4nef@phare.normalesup.org> References: <97cdc854-03d5-dd83-9e6e-e5dc6bc78e75@gmail.com> <9E87E48D-0E1F-46A1-8FF6-28B0449E65D2@sebastianraschka.com>

<96edd381-2352-f183-486a-b86e395a78f6@gmail.com>

<20181002160141.tqhhypm423ce4nef@phare.normalesup.org> Message-ID: On Tue, Oct 2, 2018 at 5:07 PM Gael Varoquaux wrote: > The reason that pickles are brittle and that sharing pickles is a bad > practice is that pickle use an implicitly defined data model, which is > defined via the internals of objects. > Plus the fact that loading a pickle can execute arbitrary code, and there is no way to know if any malicious code is in there in advance because the contents of the pickle cannot be easily inspected without loading/executing it. > So, the problems of pickle are not specific to pickle, but rather > intrinsic to any generic persistence code [*]. Writing persistence code > that > does not fall in these problems is very costly in terms of developer time > and makes it harder to add new methods or improve existing one. I am not > excited about it. > My "text-based serialization" suggestion was nowhere near as ambitious as that, as I have already explained, and wasn't aiming at solving the versioning issues, but rather at having something which is "about as good" as pickle but in a human-readable format. I am not asking for a Turing-complete language to reproduce the prediction function, but rather something simple in the spirit of the output produced by the gist code I linked above, just for the model families where it is reasonable: https://gist.github.com/jlopezpena/2cdd09c56afda5964990d5cf278bfd31 The code I posted mostly works (specific cases of nested models need to be addressed separately, as well as pipelines), and we have been using (a version of) it in production for quite some time. But there are hackish aspects to it that we are not happy with, such as the manual separation of init and fitted parameters by checking if the name ends with "_", having to infer class name and location using "model.__class__.__name__" and "model.__module__", and the wacky use of "__import__". My suggestion was more along the lines of adding some metadata to sklearn estimators so that a code in a similar style would be nicer to write; little things like having a `init_parameters` and `fit_parameters` properties that would return the lists of named parameters, or a `model_info` method that would return data like sklearn version, class name and location, or a package level dictionary pointing at the estimator classes by a string name, like from sklearn.linear_models import LogisticRegression estimator_classes = {"LogisticRegression": LogisticRegression, ...} so that one can load the appropriate class from the string description without calling __import__ or eval; that sort of stuff. I am aware this would not address the common complain of "prefect prediction reproducibility" across versions, but I think we can all agree that this utopia of perfect reproducibility is not feasible. And in the long, long run, I agree that PFA/onnx or whichever similar format that emerges, is the way to go. J -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Wed Oct 3 13:14:13 2018 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Wed, 3 Oct 2018 12:14:13 -0500 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: References: <97cdc854-03d5-dd83-9e6e-e5dc6bc78e75@gmail.com> <9E87E48D-0E1F-46A1-8FF6-28B0449E65D2@sebastianraschka.com>

<96edd381-2352-f183-486a-b86e395a78f6@gmail.com>

<20181002160141.tqhhypm423ce4nef@phare.normalesup.org> Message-ID: The ONNX-approach sounds most promising, esp. because it will also allow library interoperability but I wonder if this is for parametric models only and not for the nonparametric ones like KNN, tree-based classifiers, etc. All-in-all I can definitely see the appeal for having a way to export sklearn estimators in a text-based format (e.g., via JSON), since it would make sharing code easier. This doesn't even have to be compatible with multiple sklearn versions. A typical use case would be to include these JSON exports as e.g., supplemental files of a research paper for other people to run the models etc. (here, one can just specify which sklearn version it would require; of course, one could also share pickle files, by I am personally always hesitant reg. running/trusting other people's pickle files). Unfortunately though, as Gael pointed out, this "feature" would be a huge burden for the devs, and it would probably also negatively impact the development of scikit-learn itself because it imposes another design constraint. However, I do think this sounds like an excellent case for a contrib project. Like scikit-export, scikit-serialize or sth like that. Best, Sebastian > On Oct 3, 2018, at 5:49 AM, Javier L?pez wrote: > > > On Tue, Oct 2, 2018 at 5:07 PM Gael Varoquaux wrote: > The reason that pickles are brittle and that sharing pickles is a bad > practice is that pickle use an implicitly defined data model, which is > defined via the internals of objects. > > Plus the fact that loading a pickle can execute arbitrary code, and there is no way to know > if any malicious code is in there in advance because the contents of the pickle cannot > be easily inspected without loading/executing it. > > So, the problems of pickle are not specific to pickle, but rather > intrinsic to any generic persistence code [*]. Writing persistence code that > does not fall in these problems is very costly in terms of developer time > and makes it harder to add new methods or improve existing one. I am not > excited about it. > > My "text-based serialization" suggestion was nowhere near as ambitious as that, > as I have already explained, and wasn't aiming at solving the versioning issues, but > rather at having something which is "about as good" as pickle but in a human-readable > format. I am not asking for a Turing-complete language to reproduce the prediction > function, but rather something simple in the spirit of the output produced by the gist code I linked above, just for the model families where it is reasonable: > > https://gist.github.com/jlopezpena/2cdd09c56afda5964990d5cf278bfd31 > > The code I posted mostly works (specific cases of nested models need to be addressed > separately, as well as pipelines), and we have been using (a version of) it in production > for quite some time. But there are hackish aspects to it that we are not happy with, > such as the manual separation of init and fitted parameters by checking if the name ends with "_", having to infer class name and location using > "model.__class__.__name__" and "model.__module__", and the wacky use of "__import__". > > My suggestion was more along the lines of adding some metadata to sklearn estimators so > that a code in a similar style would be nicer to write; little things like having a `init_parameters` and `fit_parameters` properties that would return the lists of named parameters, > or a `model_info` method that would return data like sklearn version, class name and location, or a package level dictionary pointing at the estimator classes by a string name, like > > from sklearn.linear_models import LogisticRegression > estimator_classes = {"LogisticRegression": LogisticRegression, ...} > > so that one can load the appropriate class from the string description without calling __import__ or eval; that sort of stuff. > > I am aware this would not address the common complain of "prefect prediction reproducibility" > across versions, but I think we can all agree that this utopia of perfect reproducibility is not > feasible. > > And in the long, long run, I agree that PFA/onnx or whichever similar format that emerges, is > the way to go. > > J > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From nick.pentreath at gmail.com Wed Oct 3 15:32:03 2018 From: nick.pentreath at gmail.com (Nick Pentreath) Date: Wed, 3 Oct 2018 23:32:03 +0400 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: References: <97cdc854-03d5-dd83-9e6e-e5dc6bc78e75@gmail.com> <9E87E48D-0E1F-46A1-8FF6-28B0449E65D2@sebastianraschka.com>

<96edd381-2352-f183-486a-b86e395a78f6@gmail.com>

<20181002160141.tqhhypm423ce4nef@phare.normalesup.org>

Message-ID: For ONNX you may be interested in https://github.com/onnx/onnxmltools - which supports conversion of a few skelarn models to ONNX already. However as far as I am aware, none of the ONNX backends actually support the ONNX-ML extended spec (in open-source at least). So you would not be able to actually do prediction I think... As for PFA, to my current knowledge there is no library that does it yet. Our own Aardpfark project (https://github.com/CODAIT/aardpfark) focuses on SparkML export to PFA for now but would like to add sklearn support in the future. On Wed, 3 Oct 2018 at 20:07 Sebastian Raschka wrote: > The ONNX-approach sounds most promising, esp. because it will also allow > library interoperability but I wonder if this is for parametric models only > and not for the nonparametric ones like KNN, tree-based classifiers, etc. > > All-in-all I can definitely see the appeal for having a way to export > sklearn estimators in a text-based format (e.g., via JSON), since it would > make sharing code easier. This doesn't even have to be compatible with > multiple sklearn versions. A typical use case would be to include these > JSON exports as e.g., supplemental files of a research paper for other > people to run the models etc. (here, one can just specify which sklearn > version it would require; of course, one could also share pickle files, by > I am personally always hesitant reg. running/trusting other people's pickle > files). > > Unfortunately though, as Gael pointed out, this "feature" would be a huge > burden for the devs, and it would probably also negatively impact the > development of scikit-learn itself because it imposes another design > constraint. > > However, I do think this sounds like an excellent case for a contrib > project. Like scikit-export, scikit-serialize or sth like that. > > Best, > Sebastian > > > > > On Oct 3, 2018, at 5:49 AM, Javier L?pez wrote: > > > > > > On Tue, Oct 2, 2018 at 5:07 PM Gael Varoquaux < > gael.varoquaux at normalesup.org> wrote: > > The reason that pickles are brittle and that sharing pickles is a bad > > practice is that pickle use an implicitly defined data model, which is > > defined via the internals of objects. > > > > Plus the fact that loading a pickle can execute arbitrary code, and > there is no way to know > > if any malicious code is in there in advance because the contents of the > pickle cannot > > be easily inspected without loading/executing it. > > > > So, the problems of pickle are not specific to pickle, but rather > > intrinsic to any generic persistence code [*]. Writing persistence code > that > > does not fall in these problems is very costly in terms of developer time > > and makes it harder to add new methods or improve existing one. I am not > > excited about it. > > > > My "text-based serialization" suggestion was nowhere near as ambitious > as that, > > as I have already explained, and wasn't aiming at solving the versioning > issues, but > > rather at having something which is "about as good" as pickle but in a > human-readable > > format. I am not asking for a Turing-complete language to reproduce the > prediction > > function, but rather something simple in the spirit of the output > produced by the gist code I linked above, just for the model families where > it is reasonable: > > > > https://gist.github.com/jlopezpena/2cdd09c56afda5964990d5cf278bfd31 > > > > The code I posted mostly works (specific cases of nested models need to > be addressed > > separately, as well as pipelines), and we have been using (a version of) > it in production > > for quite some time. But there are hackish aspects to it that we are not > happy with, > > such as the manual separation of init and fitted parameters by checking > if the name ends with "_", having to infer class name and location using > > "model.__class__.__name__" and "model.__module__", and the wacky use of > "__import__". > > > > My suggestion was more along the lines of adding some metadata to > sklearn estimators so > > that a code in a similar style would be nicer to write; little things > like having a `init_parameters` and `fit_parameters` properties that would > return the lists of named parameters, > > or a `model_info` method that would return data like sklearn version, > class name and location, or a package level dictionary pointing at the > estimator classes by a string name, like > > > > from sklearn.linear_models import LogisticRegression > > estimator_classes = {"LogisticRegression": LogisticRegression, ...} > > > > so that one can load the appropriate class from the string description > without calling __import__ or eval; that sort of stuff. > > > > I am aware this would not address the common complain of "prefect > prediction reproducibility" > > across versions, but I think we can all agree that this utopia of > perfect reproducibility is not > > feasible. > > > > And in the long, long run, I agree that PFA/onnx or whichever similar > format that emerges, is > > the way to go. > > > > J > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Thu Oct 4 11:40:02 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 4 Oct 2018 11:40:02 -0400 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.0 In-Reply-To: References: <97cdc854-03d5-dd83-9e6e-e5dc6bc78e75@gmail.com> <9E87E48D-0E1F-46A1-8FF6-28B0449E65D2@sebastianraschka.com>

<96edd381-2352-f183-486a-b86e395a78f6@gmail.com>

<20181002160141.tqhhypm423ce4nef@phare.normalesup.org>

Message-ID: On 10/03/2018 03:32 PM, Nick Pentreath wrote: > For ONNX you may be interested in > https://github.com/onnx/onnxmltools?- which supports conversion of a > few skelarn models to ONNX already. > > However as far as I am aware, none of the ONNX backends actually > support the ONNX-ML extended spec (in open-source at least). So you > would not be able to actually do prediction I think... Exactly, that's what I'm waiting for. MS is working on itafaik. > > As for PFA, to my current knowledge there is no library that does it > yet. Our own Aardpfark project > (https://github.com/CODAIT/aardpfark)?focuses on SparkML export to PFA > for now but would like to add sklearn support in the future. > > > On Wed, 3 Oct 2018 at 20:07 Sebastian Raschka > > wrote: > > The ONNX-approach sounds most promising, esp. because it will also > allow library interoperability but I wonder if this is for > parametric models only and not for the nonparametric ones like > KNN, tree-based classifiers, etc. > > All-in-all I can definitely see the appeal for having a way to > export sklearn estimators in a text-based format (e.g., via JSON), > since it would make sharing code easier. This doesn't even have to > be compatible with multiple sklearn versions. A typical use case > would be to include these JSON exports as e.g., supplemental files > of a research paper for other people to run the models etc. (here, > one can just specify which sklearn version it would require; of > course, one could also share pickle files, by I am personally > always hesitant reg. running/trusting other people's pickle files). > > Unfortunately though, as Gael pointed out, this "feature" would be > a huge burden for the devs, and it would probably also negatively > impact the development of scikit-learn itself because it imposes > another design constraint. > > However, I do think this sounds like an excellent case for a > contrib project. Like scikit-export, scikit-serialize or sth like > that. > > Best, > Sebastian > > > > > On Oct 3, 2018, at 5:49 AM, Javier L?pez wrote: > > > > > > On Tue, Oct 2, 2018 at 5:07 PM Gael Varoquaux > > wrote: > > The reason that pickles are brittle and that sharing pickles is > a bad > > practice is that pickle use an implicitly defined data model, > which is > > defined via the internals of objects. > > > > Plus the fact that loading a pickle can execute arbitrary code, > and there is no way to know > > if any malicious code is in there in advance because the > contents of the pickle cannot > > be easily inspected without loading/executing it. > > > > So, the problems of pickle are not specific to pickle, but rather > > intrinsic to any generic persistence code [*]. Writing > persistence code that > > does not fall in these problems is very costly in terms of > developer time > > and makes it harder to add new methods or improve existing one. > I am not > > excited about it. > > > > My "text-based serialization" suggestion was nowhere near as > ambitious as that, > > as I have already explained, and wasn't aiming at solving the > versioning issues, but > > rather at having something which is "about as good" as pickle > but in a human-readable > > format. I am not asking for a Turing-complete language to > reproduce the prediction > > function, but rather something simple in the spirit of the > output produced by the gist code I linked above, just for the > model families where it is reasonable: > > > > https://gist.github.com/jlopezpena/2cdd09c56afda5964990d5cf278bfd31 > > > > The code I posted mostly works (specific cases of nested models > need to be addressed > > separately, as well as pipelines), and we have been using (a > version of) it in production > > for quite some time. But there are hackish aspects to it that we > are not happy with, > > such as the manual separation of init and fitted parameters by > checking if the name ends with "_", having to infer class name and > location using > > "model.__class__.__name__" and "model.__module__", and the wacky > use of "__import__". > > > > My suggestion was more along the lines of adding some metadata > to sklearn estimators so > > that a code in a similar style would be nicer to write; little > things like having a `init_parameters` and `fit_parameters` > properties that would return the lists of named parameters, > > or a `model_info` method that would return data like sklearn > version, class name and location, or a package level dictionary > pointing at the estimator classes by a string name, like > > > > from sklearn.linear_models import LogisticRegression > > estimator_classes = {"LogisticRegression": LogisticRegression, ...} > > > > so that one can load the appropriate class from the string > description without calling __import__ or eval; that sort of stuff. > > > > I am aware this would not address the common complain of > "prefect prediction reproducibility" > > across versions, but I think we can all agree that this utopia > of perfect reproducibility is not > > feasible. > > > > And in the long, long run, I agree that PFA/onnx or whichever > similar format that emerges, is > > the way to go. > > > > J > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From kevin at dataschool.io Fri Oct 5 12:00:20 2018 From: kevin at dataschool.io (Kevin Markham) Date: Fri, 5 Oct 2018 12:00:20 -0400 Subject: [scikit-learn] Micro average in classification report Message-ID: Hello all, Congratulations on the release of 0.20! My questions are about the updated classification_report: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html Here is the simple example shown in the documentation (apologies for the formatting): >>> from sklearn.metrics import classification_report >>> y_true = [0, 1, 2, 2, 2] >>> y_pred = [0, 0, 2, 2, 1] >>> target_names = ['class 0', 'class 1', 'class 2'] >>> print(classification_report(y_true, y_pred, target_names=target_names)) precision recall f1-score support class 0 0.50 1.00 0.67 1 class 1 0.00 0.00 0.00 1 class 2 1.00 0.67 0.80 3 micro avg 0.60 0.60 0.60 5 macro avg 0.50 0.56 0.49 5 weighted avg 0.70 0.60 0.61 5 I understand how macro average and weighted average are calculated. My questions are in regard to micro average: 1. From this and other examples, it appears to me that "micro average" is identical to classification accuracy. Is that correct? 2. Is there a reason that micro average is listed three times (under the precision, recall, and f1-score columns)? From my understanding, that 0.60 number is being calculated once but is being displayed three times. The display implies (at least in my mind) that 0.60 is being calculated from the three precision numbers, and separately calculated from the three recall numbers, and separately calculated from the three f1-score numbers, which seems misleading. 3. The documentation explains micro average as "averaging the total true positives, false negatives and false positives". If my understanding is correct that micro average is the same as accuracy, then why are true negatives any less relevant to the calculation? (Also, I don't mean to be picky, but "true positives" etc. are whole number counts rather than rates, and so it seems odd to say that you are arriving at a rate by averaging counts.) These may be dumb questions arising from my ignorance... my apologies if so! As well, I don't mean for my questions to criticize the excellent work that has been done by all of the scikit-learn contributors - I deeply appreciate your work! Rather, I'm planning to create a video series explaining some of the new features in 0.20, and I want to make sure that I'm accurately explaining these new features. Thanks very much! Kevin -- Kevin Markham Founder, Data School https://www.dataschool.io https://www.youtube.com/dataschool https://www.patreon.com/dataschool -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Oct 7 20:25:24 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 8 Oct 2018 11:25:24 +1100 Subject: [scikit-learn] Micro average in classification report In-Reply-To: References: Message-ID: A lot of this is discussed in http://scikit-learn.org/dev/modules/model_evaluation.html If you passed only a limited set of labels in, micro average would not necessarily be identical across P/R/F. This allows for a "negative label", often an experimentally uninteresting majority class. Try classification_report(y_true, y_pred, target_names=target_names, labels=[1, 2]) If you passed in multilabel data, micro average would not necessarily be identical across P/R/F. Try classification_report(np.array([[1, 0], [0, 1]]), np.array([[1, 1], [0, 1]])). Perhaps for multiclass with labels=None, we could report this differently. -------------- next part -------------- An HTML attachment was scrubbed... URL: From heinrich.jiang at gmail.com Mon Oct 8 00:02:37 2018 From: heinrich.jiang at gmail.com (Heinrich Jiang) Date: Sun, 7 Oct 2018 21:02:37 -0700 Subject: [scikit-learn] Adding Quickshift clustering algorithm Message-ID: Hello, I'm a researcher at Google Research and I am writing to initiate discussion about adding Quickshift as well as a variant of it as part of scikit-learn's set of clustering algorithms. This somewhat recent algorithm was designed as a faster alternative to Mean Shift and has been used extensively in computer vision (and already part of scikit-image). The method was published independently in these papers [1,2]. [1] has 600 citations and [2] has 1300 citations. [1] Vedaldi, Andrea, and Stefano Soatto. "Quick shift and kernel methods for mode seeking." *European Conference on Computer Vision*. Springer, Berlin, Heidelberg, 2008. [2] Rodriguez, Alex, and Alessandro Laio. "Clustering by fast search and find of density peaks." *Science* 344.6191 (2014): 1492-1496. In addition to Quickshift, I also propose a variant called Quickshift++, which is Quickshift with an additional hyperparameter. We showed in [3] that this substantially improved performance over Quickshift as well as other clustering algorithms implemented in sklearn on benchmark datasets. (i.e. Figure 9 in https://arxiv.org/abs/1805.07909) and was published at ICML 2018. [3] Jiang, Heinrich, Jennifer Jang, and Samory Kpotufe. "Quickshift++: Provably Good Initializations for Sample-Based Mean Shift." ICML 2018 We have an implementation here (https://github.com/google/quickshift). Best, Heinrich -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Oct 8 03:06:59 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 8 Oct 2018 18:06:59 +1100 Subject: [scikit-learn] scikit-learn Digest, Vol 30, Issue 25 In-Reply-To: <939d940d-ebf1-1a66-d993-301df5a65262@gmail.com> References:

<939d940d-ebf1-1a66-d993-301df5a65262@gmail.com> Message-ID: Just a note that multiple layers of stacking can be achieved with StackingClassifier using nesting. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mcasl at unileon.es Mon Oct 8 08:24:36 2018 From: mcasl at unileon.es (=?UTF-8?Q?Manuel_CASTEJ=C3=93N_LIMAS?=) Date: Mon, 8 Oct 2018 14:24:36 +0200 Subject: [scikit-learn] scikit-learn Digest, Vol 30, Issue 25 In-Reply-To: References:

<939d940d-ebf1-1a66-d993-301df5a65262@gmail.com> Message-ID: Good to know! El lun., 8 oct. 2018 9:08, Joel Nothman escribi?: > Just a note that multiple layers of stacking can be achieved with > StackingClassifier using nesting. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pranavashok at gmail.com Mon Oct 8 14:51:21 2018 From: pranavashok at gmail.com (Pranav Ashok) Date: Mon, 8 Oct 2018 11:51:21 -0700 Subject: [scikit-learn] Understanding sklearn.tree._tree.value object Message-ID: I have a multi-class multi-label decision tree learnt using DecisionTreeClassifier class. The input looks like follows: X = [[2, 51], [3, 20], [5, 30], [7, 1], [20, 46], [25, 25], [45, 70]] Y = [[1,2,3],[1,2,3],[1,2,3],[1,2],[1,2],[1],[1]] I have used MultiLabelBinarizer to convert Y into [[1 1 1] [1 1 1] [1 1 1] [1 1 0] [1 1 0] [1 0 0] [1 0 0]] After training, the _tree.values looks like follows: array([[[7., 0.], [2., 5.], [4., 3.]], [[3., 0.], [0., 3.], [0., 3.]], [[4., 0.], [2., 2.], [4., 0.]], [[2., 0.], [0., 2.], [2., 0.]], [[2., 0.], [2., 0.], [2., 0.]]]) I had the impression that the value array contains for each node, a list of lists [[n_1, y_1], [n_2, y_2], [n_3, y_3]] such that n_i are the number of samples disagreeing with class i and y_i are the number of samples agreeing with class i. But after seeing this output, it does not make sense. For example, the root node has the value [[7,0],[2,5],[4,3]]. According to my interpretation, this would mean 7 samples disagree with class 1; 2 disagree with class 2 and 5 agree with class 2; 4 disagree with class 3 and 3 agree with class 3. which, according to the input dataset is wrong. Could someone please help me understand the semantics of _tree.value for multi-label DTs? -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Mon Oct 8 17:31:43 2018 From: adrin.jalali at gmail.com (Adrin) Date: Mon, 8 Oct 2018 23:31:43 +0200 Subject: [scikit-learn] Understanding sklearn.tree._tree.value object In-Reply-To: References: Message-ID: Hi Pranav, The reason you're getting that output is that your first column has a single value (1), and that becomes your "first" class, hence your first value in the rows you're interpreting. To understand it better, you can try to check this code: >>> from sklearn.preprocessing import MultiLabelBinarizer >>> from sklearn.tree import DecisionTreeClassifier >>> >>> X = [[2, 51], [3, 20], [5, 30], [7, 1], [20, 46], [25, 25], [45, 70]] >>> Y = [[2,3],[1,2,3],[1,2,3],[1,2],[1,2],[1],[1]] >>> >>> y = MultiLabelBinarizer().fit_transform(Y) + 40 >>> y[0, 1] = 0 >>> >>> clf = DecisionTreeClassifier().fit(X, y) >>> print(clf.tree_.value) [[[1. 6. 0.] [1. 2. 4.] [4. 3. 0.]] [[1. 2. 0.] [1. 0. 2.] [0. 3. 0.]] [[0. 2. 0.] [0. 0. 2.] [0. 2. 0.]] [[1. 0. 0.] [1. 0. 0.] [0. 1. 0.]] [[0. 4. 0.] [0. 2. 2.] [4. 0. 0.]] [[0. 2. 0.] [0. 0. 2.] [2. 0. 0.]] [[0. 2. 0.] [0. 2. 0.] [2. 0. 0.]]] On Mon, 8 Oct 2018 at 20:53 Pranav Ashok wrote: > I have a multi-class multi-label decision tree learnt using > DecisionTreeClassifier class. The input looks like follows: > > X = [[2, 51], [3, 20], [5, 30], [7, 1], [20, 46], [25, 25], [45, 70]] > Y = [[1,2,3],[1,2,3],[1,2,3],[1,2],[1,2],[1],[1]] > > I have used MultiLabelBinarizer to convert Y into > > [[1 1 1] > [1 1 1] > [1 1 1] > [1 1 0] > [1 1 0] > [1 0 0] > [1 0 0]] > > > After training, the _tree.values looks like follows: > > array([[[7., 0.], > [2., 5.], > [4., 3.]], > > [[3., 0.], > [0., 3.], > [0., 3.]], > > [[4., 0.], > [2., 2.], > [4., 0.]], > > [[2., 0.], > [0., 2.], > [2., 0.]], > > [[2., 0.], > [2., 0.], > [2., 0.]]]) > > I had the impression that the value array contains for each node, a list of lists [[n_1, y_1], [n_2, y_2], [n_3, y_3]] > such that n_i are the number of samples disagreeing with class i and y_i are the number of samples agreeing with > class i. But after seeing this output, it does not make sense. > > For example, the root node has the value [[7,0],[2,5],[4,3]]. According to my interpretation, this would mean > 7 samples disagree with class 1; 2 disagree with class 2 and 5 agree with class 2; 4 disagree with class 3 and 3 agree with class 3. > > which, according to the input dataset is wrong. > > Could someone please help me understand the semantics of _tree.value for multi-label DTs? > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pranavashok at gmail.com Mon Oct 8 23:26:08 2018 From: pranavashok at gmail.com (Pranav Ashok) Date: Mon, 8 Oct 2018 20:26:08 -0700 Subject: [scikit-learn] Understanding sklearn.tree._tree.value object In-Reply-To: References:

Message-ID: Hi Adrin, Thanks for the clarification. Is there a right way of letting DecisionTreeClassifier know that the first column can take both 0 or 1, but in the current dataset we are only using 0? For example, we can let MultiLabelBinarizer know that we have three classes by instantiating it like this: MultiLabelBinarizer([1,2,3]). I tried class_weight=[{0: 1, 1: 1}, {0: 1, 1: 1}, {0: 1, 1: 1}] but that doesn't work. Thanks, Pranav On Mon, Oct 8, 2018 at 2:32 PM Adrin wrote: > Hi Pranav, > > The reason you're getting that output is that your first column has a > single value (1), and that becomes your "first" class, hence your first > value in the rows you're interpreting. > > To understand it better, you can try to check this code: > > >>> from sklearn.preprocessing import MultiLabelBinarizer > >>> from sklearn.tree import DecisionTreeClassifier > >>> > >>> X = [[2, 51], [3, 20], [5, 30], [7, 1], [20, 46], [25, 25], [45, 70]] > >>> Y = [[2,3],[1,2,3],[1,2,3],[1,2],[1,2],[1],[1]] > >>> > >>> y = MultiLabelBinarizer().fit_transform(Y) + 40 > >>> y[0, 1] = 0 > >>> > >>> clf = DecisionTreeClassifier().fit(X, y) > >>> print(clf.tree_.value) > [[[1. 6. 0.] > [1. 2. 4.] > [4. 3. 0.]] > > [[1. 2. 0.] > [1. 0. 2.] > [0. 3. 0.]] > > [[0. 2. 0.] > [0. 0. 2.] > [0. 2. 0.]] > > [[1. 0. 0.] > [1. 0. 0.] > [0. 1. 0.]] > > [[0. 4. 0.] > [0. 2. 2.] > [4. 0. 0.]] > > [[0. 2. 0.] > [0. 0. 2.] > [2. 0. 0.]] > > [[0. 2. 0.] > [0. 2. 0.] > [2. 0. 0.]]] > > > On Mon, 8 Oct 2018 at 20:53 Pranav Ashok wrote: > >> I have a multi-class multi-label decision tree learnt using >> DecisionTreeClassifier class. The input looks like follows: >> >> X = [[2, 51], [3, 20], [5, 30], [7, 1], [20, 46], [25, 25], [45, 70]] >> Y = [[1,2,3],[1,2,3],[1,2,3],[1,2],[1,2],[1],[1]] >> >> I have used MultiLabelBinarizer to convert Y into >> >> [[1 1 1] >> [1 1 1] >> [1 1 1] >> [1 1 0] >> [1 1 0] >> [1 0 0] >> [1 0 0]] >> >> >> After training, the _tree.values looks like follows: >> >> array([[[7., 0.], >> [2., 5.], >> [4., 3.]], >> >> [[3., 0.], >> [0., 3.], >> [0., 3.]], >> >> [[4., 0.], >> [2., 2.], >> [4., 0.]], >> >> [[2., 0.], >> [0., 2.], >> [2., 0.]], >> >> [[2., 0.], >> [2., 0.], >> [2., 0.]]]) >> >> I had the impression that the value array contains for each node, a list of lists [[n_1, y_1], [n_2, y_2], [n_3, y_3]] >> such that n_i are the number of samples disagreeing with class i and y_i are the number of samples agreeing with >> class i. But after seeing this output, it does not make sense. >> >> For example, the root node has the value [[7,0],[2,5],[4,3]]. According to my interpretation, this would mean >> 7 samples disagree with class 1; 2 disagree with class 2 and 5 agree with class 2; 4 disagree with class 3 and 3 agree with class 3. >> >> which, according to the input dataset is wrong. >> >> Could someone please help me understand the semantics of _tree.value for multi-label DTs? >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Tue Oct 9 03:43:52 2018 From: adrin.jalali at gmail.com (Adrin) Date: Tue, 9 Oct 2018 09:43:52 +0200 Subject: [scikit-learn] Understanding sklearn.tree._tree.value object In-Reply-To: References:

Message-ID: I'm not sure if that would make sense. If during the training, you tell the model there's only one class for a column then the model only knows that. In your case, if all samples belong to class 1 in the training data, then as far as the model is concerned, all samples belong to class 1. If you want to interpret the results, you can combine the infor you get from `clf.tree_.value` with `clf.classes_`, and then you should be fine. On Tue, 9 Oct 2018 at 05:27 Pranav Ashok wrote: > Hi Adrin, > > Thanks for the clarification. Is there a right way of letting > DecisionTreeClassifier know that the first column can take both 0 or 1, but > in the current dataset we are only using 0? > > For example, we can let MultiLabelBinarizer know that we have three > classes by instantiating it like this: MultiLabelBinarizer([1,2,3]). > > I tried class_weight=[{0: 1, 1: 1}, {0: 1, 1: 1}, {0: 1, 1: 1}] but that > doesn't work. > > Thanks, > Pranav > > On Mon, Oct 8, 2018 at 2:32 PM Adrin wrote: > >> Hi Pranav, >> >> The reason you're getting that output is that your first column has a >> single value (1), and that becomes your "first" class, hence your first >> value in the rows you're interpreting. >> >> To understand it better, you can try to check this code: >> >> >>> from sklearn.preprocessing import MultiLabelBinarizer >> >>> from sklearn.tree import DecisionTreeClassifier >> >>> >> >>> X = [[2, 51], [3, 20], [5, 30], [7, 1], [20, 46], [25, 25], [45, 70]] >> >>> Y = [[2,3],[1,2,3],[1,2,3],[1,2],[1,2],[1],[1]] >> >>> >> >>> y = MultiLabelBinarizer().fit_transform(Y) + 40 >> >>> y[0, 1] = 0 >> >>> >> >>> clf = DecisionTreeClassifier().fit(X, y) >> >>> print(clf.tree_.value) >> [[[1. 6. 0.] >> [1. 2. 4.] >> [4. 3. 0.]] >> >> [[1. 2. 0.] >> [1. 0. 2.] >> [0. 3. 0.]] >> >> [[0. 2. 0.] >> [0. 0. 2.] >> [0. 2. 0.]] >> >> [[1. 0. 0.] >> [1. 0. 0.] >> [0. 1. 0.]] >> >> [[0. 4. 0.] >> [0. 2. 2.] >> [4. 0. 0.]] >> >> [[0. 2. 0.] >> [0. 0. 2.] >> [2. 0. 0.]] >> >> [[0. 2. 0.] >> [0. 2. 0.] >> [2. 0. 0.]]] >> >> >> On Mon, 8 Oct 2018 at 20:53 Pranav Ashok wrote: >> >>> I have a multi-class multi-label decision tree learnt using >>> DecisionTreeClassifier class. The input looks like follows: >>> >>> X = [[2, 51], [3, 20], [5, 30], [7, 1], [20, 46], [25, 25], [45, 70]] >>> Y = [[1,2,3],[1,2,3],[1,2,3],[1,2],[1,2],[1],[1]] >>> >>> I have used MultiLabelBinarizer to convert Y into >>> >>> [[1 1 1] >>> [1 1 1] >>> [1 1 1] >>> [1 1 0] >>> [1 1 0] >>> [1 0 0] >>> [1 0 0]] >>> >>> >>> After training, the _tree.values looks like follows: >>> >>> array([[[7., 0.], >>> [2., 5.], >>> [4., 3.]], >>> >>> [[3., 0.], >>> [0., 3.], >>> [0., 3.]], >>> >>> [[4., 0.], >>> [2., 2.], >>> [4., 0.]], >>> >>> [[2., 0.], >>> [0., 2.], >>> [2., 0.]], >>> >>> [[2., 0.], >>> [2., 0.], >>> [2., 0.]]]) >>> >>> I had the impression that the value array contains for each node, a list of lists [[n_1, y_1], [n_2, y_2], [n_3, y_3]] >>> such that n_i are the number of samples disagreeing with class i and y_i are the number of samples agreeing with >>> class i. But after seeing this output, it does not make sense. >>> >>> For example, the root node has the value [[7,0],[2,5],[4,3]]. According to my interpretation, this would mean >>> 7 samples disagree with class 1; 2 disagree with class 2 and 5 agree with class 2; 4 disagree with class 3 and 3 agree with class 3. >>> >>> which, according to the input dataset is wrong. >>> >>> Could someone please help me understand the semantics of _tree.value for multi-label DTs? >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Guillaume.Favelier at lip6.fr Tue Oct 9 05:17:30 2018 From: Guillaume.Favelier at lip6.fr (Guillaume Favelier) Date: Tue, 09 Oct 2018 11:17:30 +0200 Subject: [scikit-learn] Dimension Reduction - MDS Message-ID: <20181009111730.Horde.CXjVqVQCmKdjeCLGk28cK35@webmail.lip6.fr> Hi everyone, I'm trying to use some dimension reduction algorithm [1] on my dataset [2] in a python script [3] but for some reason, Python seems to consume a lot of my main memory and even swap on my configuration [4] so I don't have the expected result but a memory error instead. I have the impression that this behaviour is not intended so can you help me know what I did wrong or miss somewhere please? [1]: MDS - http://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html [2]: dragon.csv - 69827 rows, 3 columns (x,y,z) [3]: dragon.py - 10 lines [4]: dragon_swap.png - htop on my workstation TAR archive: https://drive.google.com/open?id=1d1S99XeI7wNEq131wkBUCBrctPQRgpxn Best regards, Guillaume Favelier From jbbrown at kuhp.kyoto-u.ac.jp Tue Oct 9 06:51:59 2018 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Tue, 9 Oct 2018 19:51:59 +0900 Subject: [scikit-learn] Dimension Reduction - MDS In-Reply-To: <20181009111730.Horde.CXjVqVQCmKdjeCLGk28cK35@webmail.lip6.fr> References: <20181009111730.Horde.CXjVqVQCmKdjeCLGk28cK35@webmail.lip6.fr> Message-ID: Hello Guillaume, You are computing a distance matrix of shape 70000x70000 to generate MDS coordinates. That is 49,000,000 entries, plus overhead for a data structure. If you try with a very small (e.g., 100 sample) data file, does your code employing MDS work? As you increase the number of samples, does the script continue to work? Hope this helps you get started. J.B. 2018?10?9?(?) 18:22 Guillaume Favelier : > Hi everyone, > > I'm trying to use some dimension reduction algorithm [1] on my dataset > [2] in a > python script [3] but for some reason, Python seems to consume a lot of my > main memory and even swap on my configuration [4] so I don't have the > expected result > but a memory error instead. > > I have the impression that this behaviour is not intended so can you > help me know > what I did wrong or miss somewhere please? > > [1]: MDS - > http://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html > [2]: dragon.csv - 69827 rows, 3 columns (x,y,z) > [3]: dragon.py - 10 lines > [4]: dragon_swap.png - htop on my workstation > > TAR archive: > https://drive.google.com/open?id=1d1S99XeI7wNEq131wkBUCBrctPQRgpxn > > Best regards, > > Guillaume Favelier > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Oct 9 11:42:19 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 9 Oct 2018 11:42:19 -0400 Subject: [scikit-learn] Micro average in classification report In-Reply-To: References: Message-ID: <8fed3cda-d6a9-7957-9d2f-d63d9ab89916@gmail.com> On 10/05/2018 12:00 PM, Kevin Markham wrote: > Hello all, > > Congratulations on the release of 0.20! My questions are about the > updated classification_report: > http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html > > Here is the simple example shown in the documentation (apologies for > the formatting): > > >>> from sklearn.metrics import classification_report > >>> y_true = [0, 1, 2, 2, 2] > >>> y_pred = [0, 0, 2, 2, 1] > >>> target_names = ['class 0', 'class 1', 'class 2'] > >>> print(classification_report(y_true, y_pred, > target_names=target_names)) > ? ? ? ? ? ? ? precision? ? recall? f1-score ?support > > ? ? ?class 0? ? ? ?0.50? ? ? 1.00? ? ? 0.67 ?1 > ? ? ?class 1? ? ? ?0.00? ? ? 0.00? ? ? 0.00 ?1 > ? ? ?class 2? ? ? ?1.00? ? ? 0.67? ? ? 0.80 ?3 > > ? ?micro avg? ? ? ?0.60? ? ? 0.60? ? ? 0.60 ?5 > ? ?macro avg? ? ? ?0.50? ? ? 0.56? ? ? 0.49 ?5 > weighted avg? ? ? ?0.70? ? ? 0.60? ? ? 0.61 ?5 > > I understand how macro average and weighted average are calculated. My > questions are in regard to micro average: > > 1. From this and other examples, it appears to me that "micro average" > is identical to classification accuracy. Is that correct? > > 2. Is there a reason that micro average is listed three times (under > the precision, recall, and f1-score columns)? From my understanding, > that 0.60 number is being calculated once but is being displayed three > times. The display implies (at least in my mind) that 0.60 is being > calculated from the three precision numbers, and separately calculated > from the three recall numbers, and separately calculated from the > three f1-score numbers, which seems misleading. > > 3. The documentation explains micro average as "averaging the total > true positives, false negatives and false positives". If my > understanding is correct that micro average is the same as accuracy, > then why are true negatives any less relevant to the calculation? > (Also, I don't mean to be picky, but "true positives" etc. are whole > number counts rather than rates, and so it seems odd to say that you > are arriving at a rate by averaging counts.) > > These may be dumb questions arising from my ignorance... my apologies > if so! I had exactly the same comments and I find the current behavior confusing, see https://github.com/scikit-learn/scikit-learn/issues/12334 PR welcome! From Guillaume.Favelier at lip6.fr Thu Oct 11 04:28:46 2018 From: Guillaume.Favelier at lip6.fr (Guillaume Favelier) Date: Thu, 11 Oct 2018 10:28:46 +0200 Subject: [scikit-learn] Dimension Reduction - MDS In-Reply-To: References: <20181009111730.Horde.CXjVqVQCmKdjeCLGk28cK35@webmail.lip6.fr> Message-ID: <20181011102846.Horde.mTBtx9F_wzbHv2HetzjbOH5@webmail.lip6.fr> Hello J.B, Thank you for your quick reply. > If you try with a very small (e.g., 100 sample) data file, does your code > employing MDS work? > As you increase the number of samples, does the script continue to work? So I tried the same script while increasing the number of samples (100, 1000 and 10000) and it works indeed without swapping on my workstation. > That is 49,000,000 entries, plus overhead for a data structure. I thought that even 49M entries of doubles would be able to be processed with 64G of RAM. Is there something to configure to allow this computation? The typical datasets I use can have around 200-300k rows with a few columns (usually up to 3). Best regards, Guillaume Quoting "Brown J.B. via scikit-learn" : > Hello Guillaume, > > You are computing a distance matrix of shape 70000x70000 to generate MDS > coordinates. > That is 49,000,000 entries, plus overhead for a data structure. > > If you try with a very small (e.g., 100 sample) data file, does your code > employing MDS work? > As you increase the number of samples, does the script continue to work? > > Hope this helps you get started. > J.B. > > 2018?10?9?(?) 18:22 Guillaume Favelier : > >> Hi everyone, >> >> I'm trying to use some dimension reduction algorithm [1] on my dataset >> [2] in a >> python script [3] but for some reason, Python seems to consume a lot of my >> main memory and even swap on my configuration [4] so I don't have the >> expected result >> but a memory error instead. >> >> I have the impression that this behaviour is not intended so can you >> help me know >> what I did wrong or miss somewhere please? >> >> [1]: MDS - >> http://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html >> [2]: dragon.csv - 69827 rows, 3 columns (x,y,z) >> [3]: dragon.py - 10 lines >> [4]: dragon_swap.png - htop on my workstation >> >> TAR archive: >> https://drive.google.com/open?id=1d1S99XeI7wNEq131wkBUCBrctPQRgpxn >> >> Best regards, >> >> Guillaume Favelier >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> From alexandre.gramfort at inria.fr Thu Oct 11 07:12:31 2018 From: alexandre.gramfort at inria.fr (Alexandre Gramfort) Date: Thu, 11 Oct 2018 13:12:31 +0200 Subject: [scikit-learn] Dimension Reduction - MDS In-Reply-To: <20181011102846.Horde.mTBtx9F_wzbHv2HetzjbOH5@webmail.lip6.fr> References: <20181009111730.Horde.CXjVqVQCmKdjeCLGk28cK35@webmail.lip6.fr> <20181011102846.Horde.mTBtx9F_wzbHv2HetzjbOH5@webmail.lip6.fr> Message-ID: hi Guillaume, I cannot use our MDS solver at this scale. Even if you fit it in RAM it will be slow. I would play with https://github.com/lmcinnes/umap unless you really what a classic MDS. Alex On Thu, Oct 11, 2018 at 10:31 AM Guillaume Favelier wrote: > > Hello J.B, > > Thank you for your quick reply. > > > If you try with a very small (e.g., 100 sample) data file, does your code > > employing MDS work? > > As you increase the number of samples, does the script continue to work? > So I tried the same script while increasing the number of samples (100, > 1000 and 10000) and it works indeed without swapping on my workstation. > > > That is 49,000,000 entries, plus overhead for a data structure. > I thought that even 49M entries of doubles would be able to be processed > with 64G of RAM. Is there something to configure to allow this computation? > > The typical datasets I use can have around 200-300k rows with a few columns > (usually up to 3). > > Best regards, > > Guillaume > > Quoting "Brown J.B. via scikit-learn" : > > > Hello Guillaume, > > > > You are computing a distance matrix of shape 70000x70000 to generate MDS > > coordinates. > > That is 49,000,000 entries, plus overhead for a data structure. > > > > If you try with a very small (e.g., 100 sample) data file, does your code > > employing MDS work? > > As you increase the number of samples, does the script continue to work? > > > > Hope this helps you get started. > > J.B. > > > > 2018?10?9?(?) 18:22 Guillaume Favelier : > > > >> Hi everyone, > >> > >> I'm trying to use some dimension reduction algorithm [1] on my dataset > >> [2] in a > >> python script [3] but for some reason, Python seems to consume a lot of my > >> main memory and even swap on my configuration [4] so I don't have the > >> expected result > >> but a memory error instead. > >> > >> I have the impression that this behaviour is not intended so can you > >> help me know > >> what I did wrong or miss somewhere please? > >> > >> [1]: MDS - > >> http://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html > >> [2]: dragon.csv - 69827 rows, 3 columns (x,y,z) > >> [3]: dragon.py - 10 lines > >> [4]: dragon_swap.png - htop on my workstation > >> > >> TAR archive: > >> https://drive.google.com/open?id=1d1S99XeI7wNEq131wkBUCBrctPQRgpxn > >> > >> Best regards, > >> > >> Guillaume Favelier > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From jbbrown at kuhp.kyoto-u.ac.jp Thu Oct 11 10:30:37 2018 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Thu, 11 Oct 2018 23:30:37 +0900 Subject: [scikit-learn] Dimension Reduction - MDS In-Reply-To: References: <20181009111730.Horde.CXjVqVQCmKdjeCLGk28cK35@webmail.lip6.fr> <20181011102846.Horde.mTBtx9F_wzbHv2HetzjbOH5@webmail.lip6.fr> Message-ID: Hi Guillaume, The good news is that your script works as-is on smaller datasets, and hopefully does the logic for your task correctly. In addition to Alex's comment about data size and MDS tractability, I would also point out a philosophical issue -- why consider MDS for such a large dataset? At least in two dimensions, once MDS gets beyond 1000 samples or so, the resulting sample coordinates and its visualization are potentially highly dispersed (e.g., like a 2D-uniform distribution) and may not lead to interpretability. One can move to three-dimensional MDS, but perhaps even then a few thousand samples gets to the limit of graphical interpretability. It very obviously depends on the relationships in your data. Also, as you continue your work, keep in mind that the per-sample dimensionality (number of entries in a single sample's descriptor vector) will not be the primary determinant of the memory consumption requirements for the MDS algorithm, because in any case you must compute (either inline or pre-compute) the distance matrix between each pair of samples, and that matrix stays in memory during coordinate generation (as far as I know). So, 10 chemical descriptors (since I noticed you mentioning Dragon) or 1000 descriptors will still result in the same memory requirement for the distance matrix, and then scaling to hundreds of thousands of samples will eat all of the compute node's RAM. Since you have 200k samples, you could potentially do some type of repeated partial clustering (e.g., on random subsamples of data) to find a reasonable number of clusters per repetition, analyze those results to make an estimate of a number of clusters for a global clustering, and then select a limited number of samples per cluster to use for projection to a coordinate space by MDS. Or a diversity selection (either by vector distance or in your case, differing compound scaffolds) may be a way to get a quick subset and visualize distance relationships. Hope this helps. Sincerely, J.B. Brown 2018?10?11?(?) 20:14 Alexandre Gramfort : > hi Guillaume, > > I cannot use our MDS solver at this scale. Even if you fit it in RAM > it will be slow. > > I would play with https://github.com/lmcinnes/umap unless you really > what a classic MDS. > > Alex > > On Thu, Oct 11, 2018 at 10:31 AM Guillaume Favelier > wrote: > > > > Hello J.B, > > > > Thank you for your quick reply. > > > > > If you try with a very small (e.g., 100 sample) data file, does your > code > > > employing MDS work? > > > As you increase the number of samples, does the script continue to > work? > > So I tried the same script while increasing the number of samples (100, > > 1000 and 10000) and it works indeed without swapping on my workstation. > > > > > That is 49,000,000 entries, plus overhead for a data structure. > > I thought that even 49M entries of doubles would be able to be processed > > with 64G of RAM. Is there something to configure to allow this > computation? > > > > The typical datasets I use can have around 200-300k rows with a few > columns > > (usually up to 3). > > > > Best regards, > > > > Guillaume > > > > Quoting "Brown J.B. via scikit-learn" : > > > > > Hello Guillaume, > > > > > > You are computing a distance matrix of shape 70000x70000 to generate > MDS > > > coordinates. > > > That is 49,000,000 entries, plus overhead for a data structure. > > > > > > If you try with a very small (e.g., 100 sample) data file, does your > code > > > employing MDS work? > > > As you increase the number of samples, does the script continue to > work? > > > > > > Hope this helps you get started. > > > J.B. > > > > > > 2018?10?9?(?) 18:22 Guillaume Favelier : > > > > > >> Hi everyone, > > >> > > >> I'm trying to use some dimension reduction algorithm [1] on my dataset > > >> [2] in a > > >> python script [3] but for some reason, Python seems to consume a lot > of my > > >> main memory and even swap on my configuration [4] so I don't have the > > >> expected result > > >> but a memory error instead. > > >> > > >> I have the impression that this behaviour is not intended so can you > > >> help me know > > >> what I did wrong or miss somewhere please? > > >> > > >> [1]: MDS - > > >> > http://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html > > >> [2]: dragon.csv - 69827 rows, 3 columns (x,y,z) > > >> [3]: dragon.py - 10 lines > > >> [4]: dragon_swap.png - htop on my workstation > > >> > > >> TAR archive: > > >> https://drive.google.com/open?id=1d1S99XeI7wNEq131wkBUCBrctPQRgpxn > > >> > > >> Best regards, > > >> > > >> Guillaume Favelier > > >> > > >> _______________________________________________ > > >> scikit-learn mailing list > > >> scikit-learn at python.org > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > >> > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Fri Oct 12 12:44:37 2018 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Fri, 12 Oct 2018 18:44:37 +0200 Subject: [scikit-learn] Release of imbalanced-learn 0.4 Message-ID: Hi folks, The new release of imbalanced-learn is available in PyPi and conda-forge. You can find all the new features and changes at: http://imbalanced-learn.org/en/stable/whats_new.html#version-0-4 The documentation is available at: http://imbalanced-learn.org/en/stable/ Hope this helps. Cheers, -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ From fabiansd1402 at gmail.com Fri Oct 19 03:52:18 2018 From: fabiansd1402 at gmail.com (fabian dietrichson) Date: Fri, 19 Oct 2018 09:52:18 +0200 Subject: [scikit-learn] Using Scikit-learn graphics for AI workshop, NOKIOS conference Message-ID: Hi Scikit-learn! I am employed in Accenture and have been using your machine learning library extensively! It is a well designed library, and my wepon of choice whenever hosting AI workshops. Next week we are will arrange an AI workshop at a conference in Norway, Trondheim, called NOKIOS http://www.nokios.no/english/ . For this workshop we would like to use Sciki-learn in the case study and have the participants use your model-chart to select estimator http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html . This will promote your library to several executives, as this is a ?how to use AI in your business? course. However, I noticed that your images are protected with copy rights, and I?m asking if I?m allowed to use your illustration for this purpose? Kind regards Fabian S?dal Dietrichson Accenture Developer - IES Norway -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Fri Oct 19 09:31:50 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Fri, 19 Oct 2018 15:31:50 +0200 Subject: [scikit-learn] Using Scikit-learn graphics for AI workshop, NOKIOS conference In-Reply-To: References: Message-ID: <20181019133150.lfmpafidvkm6n433@phare.normalesup.org> On Fri, Oct 19, 2018 at 09:52:18AM +0200, fabian dietrichson wrote: > However, I noticed that your images are protected with copy rights, and I?m > asking if I?m allowed to use your illustration for this purpose? Which images specifically do you have in mind? Ga?l From fabiansd1402 at gmail.com Fri Oct 19 10:28:35 2018 From: fabiansd1402 at gmail.com (fabian dietrichson) Date: Fri, 19 Oct 2018 16:28:35 +0200 Subject: [scikit-learn] Using Scikit-learn graphics for AI workshop, NOKIOS conference In-Reply-To: <20181019133150.lfmpafidvkm6n433@phare.normalesup.org> References: <20181019133150.lfmpafidvkm6n433@phare.normalesup.org> Message-ID: Hi Ga?l, The modem chart https://www.google.no/search?q=scikit+learn+model+chart&rlz=1CDGOYI_enNO643NO643&hl=nb&prmd=ivsn&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjZqbSX25LeAhXF1iwKHXwyCMYQ_AUIESgB&biw=375&bih=551#imgrc=BxrGpsOIhhOFJM I will then do some live coding in Scikit-Learn Fabian fre. 19. okt. 2018 kl. 15:33 skrev Gael Varoquaux < gael.varoquaux at normalesup.org>: > On Fri, Oct 19, 2018 at 09:52:18AM +0200, fabian dietrichson wrote: > > However, I noticed that your images are protected with copy rights, and > I?m > > asking if I?m allowed to use your illustration for this purpose? > > Which images specifically do you have in mind? > > Ga?l > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Oct 19 13:13:15 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 19 Oct 2018 13:13:15 -0400 Subject: [scikit-learn] Using Scikit-learn graphics for AI workshop, NOKIOS conference In-Reply-To: References: <20181019133150.lfmpafidvkm6n433@phare.normalesup.org> Message-ID: <958d7d20-1ee9-c6de-1ee7-c1521e982fad@gmail.com> The original chart is CC-0, meaning you can use it legally without attribution, see http://peekaboo-vision.blogspot.com/2013/01/machine-learning-cheat-sheet-for-scikit.html Attribution is obviously still encouraged. On 10/19/2018 10:28 AM, fabian dietrichson wrote: > Hi Ga?l, > > The modem chart > > https://www.google.no/search?q=scikit+learn+model+chart&rlz=1CDGOYI_enNO643NO643&hl=nb&prmd=ivsn&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjZqbSX25LeAhXF1iwKHXwyCMYQ_AUIESgB&biw=375&bih=551#imgrc=BxrGpsOIhhOFJM > > > I will then do some live coding in Scikit-Learn > > Fabian > > fre. 19. okt. 2018 kl. 15:33 skrev Gael Varoquaux > >: > > On Fri, Oct 19, 2018 at 09:52:18AM +0200, fabian dietrichson wrote: > > However, I noticed that your images are protected with copy > rights, and I?m > > asking if I?m allowed to use your illustration for this purpose? > > Which images specifically do you have in mind? > > Ga?l > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From mathieu at mblondel.org Tue Oct 23 09:10:49 2018 From: mathieu at mblondel.org (Mathieu Blondel) Date: Tue, 23 Oct 2018 22:10:49 +0900 Subject: [scikit-learn] Sparse predict_proba and Fenchel-Young losses Message-ID: Hi, Most scikit-learn users who need predict_proba use the logistic regression class. We've released a new package implementing more loss functions useful for probabilistic classification. https://github.com/mblondel/fenchel-young-losses/ This is based on our recently proposed family of loss functions called "Fenchel-Young losses" [*]. Two distinguishing features that should be of interest: 1) You can call fit(X, Y) where Y is a n_samples array of label integers *or* Y is a n_samples x n_classes array containing *label proportions*. 2) predict_proba(X) is able to output *sparse* probabilities for some choices of loss functions (loss="sparsemax" or loss="tsallis"). This means that some classes may get *exactly* zero probability. Both features are especially useful in a multi-label setting. We've also released drop-in replacements for PyTorch and Tensorflow loss functions in the same package. Feedback welcome! Cheers, Mathieu [*] https://arxiv.org/abs/1805.09717 -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuel.castejon at gmail.com Wed Oct 24 04:11:06 2018 From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=) Date: Wed, 24 Oct 2018 10:11:06 +0200 Subject: [scikit-learn] Pipegraph example: KMeans + LDA Message-ID: Dear all, as a way of improving the documentation of PipeGraph we intend to provide more examples of its usage. It was a popular demand to show application cases to motivate its usage, so here it is a very simple case with two steps: a KMeans followed by a LDA. https://mcasl.github.io/PipeGraph/auto_examples/plot_Finding_Number_of_clusters.html#sphx-glr-auto-examples-plot-finding-number-of-clusters-py This short example points out the following challenges: - KMeans is not a transformer but an estimator - LDA score function requires the y parameter, while its input does not come from a known set of labels, but from the previous KMeans - Moreover, the GridSearchCV.fit call would also require a 'y' parameter - It would be nice to have access to the output of the KMeans step as well. PipeGraph is capable of addressing these challenges. The rationale for this example lies in the identification-reconstruction realm. In a scenario where the class labels are unknown, we might want to associate the quality of the clustering structure to the capability of a later model to be able to reconstruct this structure. So the basic idea here is that if LDA is capable of getting good results it was because the information of the KMeans was good enough for that purpose, hinting the discovery of a good structure. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sbrightaboh at gmail.com Wed Oct 24 07:29:03 2018 From: sbrightaboh at gmail.com (bright silas Aboh) Date: Wed, 24 Oct 2018 11:29:03 +0000 Subject: [scikit-learn] Error with Kfold cross vailidation Message-ID: Hi Everyone, I am Bright and am trying to build a machine learning model with sklearn I get the following error however, can someone please help me? kf = KFold(data.shape[0], n_splits=5) TypeError: __init__() got multiple values for argument 'n_splits' Thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: From seralouk at hotmail.com Wed Oct 24 08:02:04 2018 From: seralouk at hotmail.com (serafim loukas) Date: Wed, 24 Oct 2018 12:02:04 +0000 Subject: [scikit-learn] Error with Kfold cross vailidation In-Reply-To: References: Message-ID: <42626690-7587-4026-A201-3E94EB2B7F08@hotmail.com> Hello, Do you import KFold from sklearn.model_selection ? On 24 Oct 2018, at 13:29, bright silas Aboh > wrote: Hi Everyone, I am Bright and am trying to build a machine learning model with sklearn I get the following error however, can someone please help me? kf = KFold(data.shape[0], n_splits=5) TypeError: __init__() got multiple values for argument 'n_splits' Thank you _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From sbrightaboh at gmail.com Wed Oct 24 08:05:48 2018 From: sbrightaboh at gmail.com (bright silas Aboh) Date: Wed, 24 Oct 2018 12:05:48 +0000 Subject: [scikit-learn] Error with Kfold cross vailidation In-Reply-To: <42626690-7587-4026-A201-3E94EB2B7F08@hotmail.com> References: <42626690-7587-4026-A201-3E94EB2B7F08@hotmail.com> Message-ID: yes please. I import KFold from sklearn.model_selection On Wed, Oct 24, 2018 at 12:02 PM serafim loukas wrote: > Hello, > > > Do you import KFold from sklearn.model_selection ? > > > > On 24 Oct 2018, at 13:29, bright silas Aboh wrote: > > Hi Everyone, > > I am Bright and am trying to build a machine learning model with sklearn > I get the following error however, can someone please help me? > > kf = KFold(data.shape[0], n_splits=5) > TypeError: __init__() got multiple values for argument 'n_splits' > > Thank you > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From emmanuelarias30 at gmail.com Wed Oct 24 08:45:13 2018 From: emmanuelarias30 at gmail.com (eamanu15) Date: Wed, 24 Oct 2018 09:45:13 -0300 Subject: [scikit-learn] =?utf-8?q?=E2=80=8BRe=3A_Error_with_Kfold_cross_v?= =?utf-8?q?ailidation?= In-Reply-To: References: Message-ID: Hello Bright! > I am Bright and am trying to build a machine learning model with sklearn > I get the following error however, can someone please help me? > > kf = KFold(data.shape[0], n_splits=5) > TypeError: __init__() got multiple values for argument 'n_splits' > > Reading the doc [1] I think that the problem is that you are setting the parameters from a wrong way. The first parameter is n_split, but you set data.shape[0] and then you set again "n_splits=5" That is the error. [1] http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html -- Arias Emmanuel http://eamanu.com Github/Gitlab; @eamanu Debian: @eamanu-guest -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Wed Oct 24 09:23:00 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Wed, 24 Oct 2018 15:23:00 +0200 Subject: [scikit-learn] Error with Kfold cross vailidation In-Reply-To: References: Message-ID: <20181024132300.frht4ziwkldpcoxa@phare.normalesup.org> > ? kf = KFold(data.shape[0], n_splits=5) > TypeError: __init__() got multiple values for argument 'n_splits' Don't specify data.shape[0], this is no longer necessary in the recent versions of scikit-learn. From sbrightaboh at gmail.com Wed Oct 24 09:35:24 2018 From: sbrightaboh at gmail.com (bright silas Aboh) Date: Wed, 24 Oct 2018 13:35:24 +0000 Subject: [scikit-learn] Error with Kfold cross vailidation In-Reply-To: <20181024132300.frht4ziwkldpcoxa@phare.normalesup.org> References: <20181024132300.frht4ziwkldpcoxa@phare.normalesup.org> Message-ID: Okey. I did removed the data.shape as suggested but I am now having a new error that says: Kfold object not iterable On Wed, 24 Oct 2018 at 13:23, Gael Varoquaux wrote: > > kf = KFold(data.shape[0], n_splits=5) > > TypeError: __init__() got multiple values for argument 'n_splits' > > Don't specify data.shape[0], this is no longer necessary in the recent > versions of scikit-learn. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From seralouk at hotmail.com Wed Oct 24 09:37:31 2018 From: seralouk at hotmail.com (serafim loukas) Date: Wed, 24 Oct 2018 13:37:31 +0000 Subject: [scikit-learn] Error with Kfold cross vailidation In-Reply-To: References: <20181024132300.frht4ziwkldpcoxa@phare.normalesup.org>, Message-ID: What is your scikit learn version? In case you have the latest try to reinstall the module. On 24 Oct 2018, at 15:36, bright silas Aboh > wrote: Okey. I did removed the data.shape as suggested but I am now having a new error that says: Kfold object not iterable On Wed, 24 Oct 2018 at 13:23, Gael Varoquaux > wrote: > kf = KFold(data.shape[0], n_splits=5) > TypeError: __init__() got multiple values for argument 'n_splits' Don't specify data.shape[0], this is no longer necessary in the recent versions of scikit-learn. _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From sbrightaboh at gmail.com Wed Oct 24 09:38:30 2018 From: sbrightaboh at gmail.com (bright silas Aboh) Date: Wed, 24 Oct 2018 13:38:30 +0000 Subject: [scikit-learn] Error with Kfold cross vailidation In-Reply-To: References: <20181024132300.frht4ziwkldpcoxa@phare.normalesup.org> Message-ID: Yes.Its the latest On Wed, 24 Oct 2018 at 13:37, serafim loukas wrote: > What is your scikit learn version? > > In case you have the latest try to reinstall the module. > > On 24 Oct 2018, at 15:36, bright silas Aboh wrote: > > Okey. I did removed the data.shape as suggested but I am now having a new > error that says: > Kfold object not iterable > > On Wed, 24 Oct 2018 at 13:23, Gael Varoquaux < > gael.varoquaux at normalesup.org> wrote: > >> > kf = KFold(data.shape[0], n_splits=5) >> > TypeError: __init__() got multiple values for argument 'n_splits' >> >> Don't specify data.shape[0], this is no longer necessary in the recent >> versions of scikit-learn. >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Wed Oct 24 18:53:15 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 25 Oct 2018 09:53:15 +1100 Subject: [scikit-learn] Error with Kfold cross vailidation In-Reply-To: References: <20181024132300.frht4ziwkldpcoxa@phare.normalesup.org> Message-ID: Yes, it is not iterable. You are copying a tutorial or code that describes the usage of sklearn.cross_validation.KFold, which no longer exists in version 0.20. Find an example with the newer sklearn.model_selection.KFold. On Thu, 25 Oct 2018 at 00:36, bright silas Aboh wrote: > Okey. I did removed the data.shape as suggested but I am now having a new > error that says: > Kfold object not iterable > > On Wed, 24 Oct 2018 at 13:23, Gael Varoquaux < > gael.varoquaux at normalesup.org> wrote: > >> > kf = KFold(data.shape[0], n_splits=5) >> > TypeError: __init__() got multiple values for argument 'n_splits' >> >> Don't specify data.shape[0], this is no longer necessary in the recent >> versions of scikit-learn. >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sbrightaboh at gmail.com Thu Oct 25 09:15:12 2018 From: sbrightaboh at gmail.com (bright silas Aboh) Date: Thu, 25 Oct 2018 13:15:12 +0000 Subject: [scikit-learn] Error with Kfold cross vailidation In-Reply-To: References: <20181024132300.frht4ziwkldpcoxa@phare.normalesup.org> Message-ID: Ok. And thanks On Wed, 24 Oct 2018 at 22:53, Joel Nothman wrote: > Yes, it is not iterable. You are copying a tutorial or code that describes > the usage of sklearn.cross_validation.KFold, which no longer exists in > version 0.20. Find an example with the newer sklearn.model_selection.KFold. > > On Thu, 25 Oct 2018 at 00:36, bright silas Aboh > wrote: > >> Okey. I did removed the data.shape as suggested but I am now having a new >> error that says: >> Kfold object not iterable >> >> On Wed, 24 Oct 2018 at 13:23, Gael Varoquaux < >> gael.varoquaux at normalesup.org> wrote: >> >>> > kf = KFold(data.shape[0], n_splits=5) >>> > TypeError: __init__() got multiple values for argument 'n_splits' >>> >>> Don't specify data.shape[0], this is no longer necessary in the recent >>> versions of scikit-learn. >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Thu Oct 25 12:26:15 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 25 Oct 2018 12:26:15 -0400 Subject: [scikit-learn] Sparse predict_proba and Fenchel-Young losses In-Reply-To: References: Message-ID: Awesome! On 10/23/18 9:10 AM, Mathieu Blondel wrote: > Hi, > > Most scikit-learn users who need predict_proba use the logistic > regression class. We've released a new package implementing more loss > functions useful for probabilistic classification. > > https://github.com/mblondel/fenchel-young-losses/ > > This is based on our recently proposed family of loss functions called > "Fenchel-Young losses" [*]. > > Two distinguishing features that should be of interest: > > 1) You can call fit(X, Y) where Y is a n_samples array of label > integers *or* Y is a n_samples x n_classes array containing *label > proportions*. We've gotten that feature request for logistic regression a couple of times, not sure it's in the scope of scikit-learn. Great to see that you've done it! > > 2) predict_proba(X) is able to output *sparse* probabilities for some > choices of loss functions (loss="sparsemax" or loss="tsallis"). This > means that some classes may get *exactly* zero probability. > > Both features are especially useful in a multi-label setting. > > We've also released drop-in replacements for PyTorch and Tensorflow > loss functions in the same package. > > Feedback welcome! > > Cheers, > Mathieu > > [*] https://arxiv.org/abs/1805.09717 > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Thu Oct 25 12:46:20 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 25 Oct 2018 12:46:20 -0400 Subject: [scikit-learn] Google Season of Docs Message-ID: <339fc052-fe50-0726-b704-91e5c2ce4657@gmail.com> Hey. Are we interested in the Google Season of Docs? https://docs.google.com/forms/d/e/1FAIpQLSf-njReSfmp5i2olgmsDzrFR0Ll0UB5LkCzrtyM5o9Yw0foPw/viewform https://docs.google.com/presentation/d/1ABqCc5uAoQv9aqGCxmNqOJ9S_Tst-adNV3fcWQ2Quwc/edit#slide=id.g42b115f18c_0_0 It requires a mentor, which has been an issue in the past. But it looks like the idea is to have professionals partner up with projects, not students. The other problem would of course be formulating a clearly defined project. I think we could probably use some restructuring, or more focused tutorials. Wdyt? Andy From sean.violante at gmail.com Fri Oct 26 11:06:15 2018 From: sean.violante at gmail.com (Sean Violante) Date: Fri, 26 Oct 2018 17:06:15 +0200 Subject: [scikit-learn] Sparse predict_proba and Fenchel-Young losses In-Reply-To: References:

Message-ID: 1) You can call fit(X, Y) where Y is a n_samples array of label integers *or* Y is a n_samples x n_classes array containing *label proportions*. Matthieu - that's great. In glmnet it is implemented directly as counts (not proportions) - which may be more natural. I find it a shame this is not implemented in sklearn - if ever sample weights is properly added to sklearn (eg for testing) it would be great to handle this as well. For me the use case is grouped data (for memory efficiency) - where this comes naturally. it would then benefit to add a crossvalidation that 'ignored grouping' ie replicating sampling uniformly from ungrouped data. On Thu, Oct 25, 2018 at 6:27 PM Andreas Mueller wrote: > Awesome! > > On 10/23/18 9:10 AM, Mathieu Blondel wrote: > > Hi, > > Most scikit-learn users who need predict_proba use the logistic regression > class. We've released a new package implementing more loss functions useful > for probabilistic classification. > > https://github.com/mblondel/fenchel-young-losses/ > > This is based on our recently proposed family of loss functions called > "Fenchel-Young losses" [*]. > > Two distinguishing features that should be of interest: > > 1) You can call fit(X, Y) where Y is a n_samples array of label integers > *or* Y is a n_samples x n_classes array containing *label proportions*. > > We've gotten that feature request for logistic regression a couple of > times, not sure it's in the scope of scikit-learn. > Great to see that you've done it! > > > 2) predict_proba(X) is able to output *sparse* probabilities for some > choices of loss functions (loss="sparsemax" or loss="tsallis"). This means > that some classes may get *exactly* zero probability. > > Both features are especially useful in a multi-label setting. > > We've also released drop-in replacements for PyTorch and Tensorflow loss > functions in the same package. > > Feedback welcome! > > Cheers, > Mathieu > > [*] https://arxiv.org/abs/1805.09717 > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Sat Oct 27 18:39:42 2018 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Sat, 27 Oct 2018 17:39:42 -0500 Subject: [scikit-learn] How does the random state influence the decision tree splits? Message-ID: <2675225F-ABAF-4898-B780-7F77D8E808A7@sebastianraschka.com> Hi all, when I was implementing a bagging classifier based on scikit-learn's DecisionTreeClassifier, I noticed that the results were not deterministic and found that this was due to the random_state in the DescisionTreeClassifier (which is set to None by default). I am wondering what exactly this random state is used for? I can imaging it being used for resolving ties if the information gain for multiple features is the same, or it could be that the feature splits of continuous features is different? (I thought the heuristic is to sort the features and to consider those feature values next to each associated with examples that have different class labels -- but is there maybe some random subselection involved?) If someone knows more about this, where the random_state is used, I'd be happy to hear it :) Also, we could then maybe add the info to the DecisionTreeClassifier's docstring, which is currently a bit too generic to be useful, I think: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py random_state : int, RandomState instance or None, optional (default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by `np.random`. Best, Sebastian From jlopez at ende.cc Sat Oct 27 19:16:39 2018 From: jlopez at ende.cc (=?UTF-8?Q?Javier_L=C3=B3pez?=) Date: Sun, 28 Oct 2018 00:16:39 +0100 Subject: [scikit-learn] How does the random state influence the decision tree splits? In-Reply-To: <2675225F-ABAF-4898-B780-7F77D8E808A7@sebastianraschka.com> References: <2675225F-ABAF-4898-B780-7F77D8E808A7@sebastianraschka.com> Message-ID: Hi Sebastian, I think the random state is used to select the features that go into each split (look at the `max_features` parameter) Cheers, Javier On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka < mail at sebastianraschka.com> wrote: > Hi all, > > when I was implementing a bagging classifier based on scikit-learn's > DecisionTreeClassifier, I noticed that the results were not deterministic > and found that this was due to the random_state in the > DescisionTreeClassifier (which is set to None by default). > > I am wondering what exactly this random state is used for? I can imaging > it being used for resolving ties if the information gain for multiple > features is the same, or it could be that the feature splits of continuous > features is different? (I thought the heuristic is to sort the features and > to consider those feature values next to each associated with examples that > have different class labels -- but is there maybe some random subselection > involved?) > > If someone knows more about this, where the random_state is used, I'd be > happy to hear it :) > > Also, we could then maybe add the info to the DecisionTreeClassifier's > docstring, which is currently a bit too generic to be useful, I think: > > > https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py > > > random_state : int, RandomState instance or None, optional > (default=None) > If int, random_state is the seed used by the random number > generator; > If RandomState instance, random_state is the random number > generator; > If None, the random number generator is the RandomState instance > used > by `np.random`. > > > Best, > Sebastian > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Sat Oct 27 20:24:50 2018 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Sat, 27 Oct 2018 19:24:50 -0500 Subject: [scikit-learn] How does the random state influence the decision tree splits? In-Reply-To: References: <2675225F-ABAF-4898-B780-7F77D8E808A7@sebastianraschka.com> Message-ID: <49701ADA-9C16-463C-B78F-8F5F18BCAFA6@sebastianraschka.com> Thanks, Javier, however, the max_features is n_features by default. But if you execute sth like import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier iris = load_iris() X, y = iris.data, iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123, shuffle=True, stratify=y) for i in range(20): tree = DecisionTreeClassifier() tree.fit(X_train, y_train) print(tree.score(X_test, y_test)) You will find that the tree will produce different results if you don't fix the random seed. I suspect, related to what you said about the random feature selection if max_features is not n_features, that there is generally some sorting of the features going on, and the different trees are then due to tie-breaking if two features have the same information gain? Best, Sebastian > On Oct 27, 2018, at 6:16 PM, Javier L?pez wrote: > > Hi Sebastian, > > I think the random state is used to select the features that go into each split (look at the `max_features` parameter) > > Cheers, > Javier > > On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka wrote: > Hi all, > > when I was implementing a bagging classifier based on scikit-learn's DecisionTreeClassifier, I noticed that the results were not deterministic and found that this was due to the random_state in the DescisionTreeClassifier (which is set to None by default). > > I am wondering what exactly this random state is used for? I can imaging it being used for resolving ties if the information gain for multiple features is the same, or it could be that the feature splits of continuous features is different? (I thought the heuristic is to sort the features and to consider those feature values next to each associated with examples that have different class labels -- but is there maybe some random subselection involved?) > > If someone knows more about this, where the random_state is used, I'd be happy to hear it :) > > Also, we could then maybe add the info to the DecisionTreeClassifier's docstring, which is currently a bit too generic to be useful, I think: > > https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py > > > random_state : int, RandomState instance or None, optional (default=None) > If int, random_state is the seed used by the random number generator; > If RandomState instance, random_state is the random number generator; > If None, the random number generator is the RandomState instance used > by `np.random`. > > > Best, > Sebastian > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From julio at esbet.es Sun Oct 28 00:07:26 2018 From: julio at esbet.es (Julio Antonio Soto de Vicente) Date: Sun, 28 Oct 2018 05:07:26 +0100 Subject: [scikit-learn] How does the random state influence the decision tree splits? In-Reply-To: <49701ADA-9C16-463C-B78F-8F5F18BCAFA6@sebastianraschka.com> References: <2675225F-ABAF-4898-B780-7F77D8E808A7@sebastianraschka.com> <49701ADA-9C16-463C-B78F-8F5F18BCAFA6@sebastianraschka.com> Message-ID: Hmmm that?s weird... Have you tried to plot the trees (the decision rules) for the tree with different seeds, and see if the gain for the first split is the same even if the split itself is different? I?d at least try that before diving into the source code... Cheers, -- Julio > El 28 oct 2018, a las 2:24, Sebastian Raschka escribi?: > > Thanks, Javier, > > however, the max_features is n_features by default. But if you execute sth like > > import numpy as np > from sklearn.datasets import load_iris > from sklearn.model_selection import train_test_split > from sklearn.tree import DecisionTreeClassifier > > iris = load_iris() > X, y = iris.data, iris.target > X_train, X_test, y_train, y_test = train_test_split(X, y, > test_size=0.3, > random_state=123, > shuffle=True, > stratify=y) > > for i in range(20): > tree = DecisionTreeClassifier() > tree.fit(X_train, y_train) > print(tree.score(X_test, y_test)) > > > > You will find that the tree will produce different results if you don't fix the random seed. I suspect, related to what you said about the random feature selection if max_features is not n_features, that there is generally some sorting of the features going on, and the different trees are then due to tie-breaking if two features have the same information gain? > > Best, > Sebastian > > > >> On Oct 27, 2018, at 6:16 PM, Javier L?pez wrote: >> >> Hi Sebastian, >> >> I think the random state is used to select the features that go into each split (look at the `max_features` parameter) >> >> Cheers, >> Javier >> >> On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka wrote: >> Hi all, >> >> when I was implementing a bagging classifier based on scikit-learn's DecisionTreeClassifier, I noticed that the results were not deterministic and found that this was due to the random_state in the DescisionTreeClassifier (which is set to None by default). >> >> I am wondering what exactly this random state is used for? I can imaging it being used for resolving ties if the information gain for multiple features is the same, or it could be that the feature splits of continuous features is different? (I thought the heuristic is to sort the features and to consider those feature values next to each associated with examples that have different class labels -- but is there maybe some random subselection involved?) >> >> If someone knows more about this, where the random_state is used, I'd be happy to hear it :) >> >> Also, we could then maybe add the info to the DecisionTreeClassifier's docstring, which is currently a bit too generic to be useful, I think: >> >> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py >> >> >> random_state : int, RandomState instance or None, optional (default=None) >> If int, random_state is the seed used by the random number generator; >> If RandomState instance, random_state is the random number generator; >> If None, the random number generator is the RandomState instance used >> by `np.random`. >> >> >> Best, >> Sebastian >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From louis.abraham at yahoo.fr Sun Oct 28 02:40:36 2018 From: louis.abraham at yahoo.fr (Louis Abraham) Date: Sun, 28 Oct 2018 07:40:36 +0100 Subject: [scikit-learn] Strange code but that works Message-ID: <44746C14-07A8-4FEF-BFF1-1706F9D7CDAE@yahoo.fr> Hi, This is a code from sklearn.pipeline.Pipeline: @property def transform(self): """Apply transforms, and transform with the final estimator This also works where final estimator is ``None``: all prior transformations are applied. Parameters ---------- X : iterable Data to transform. Must fulfill input requirements of first step of the pipeline. Returns ------- Xt : array-like, shape = [n_samples, n_transformed_features] """ # _final_estimator is None or has transform, otherwise attribute error # XXX: Handling the None case means we can't use if_delegate_has_method if self._final_estimator is not None: self._final_estimator.transform return self._transform I don't understand why `self._final_estimator.transform` can be returned, ignoring all the previous transformers. However, when testing it works: ``` >>> p = make_pipeline(FunctionTransformer(lambda x: 2*x), FunctionTransformer(lambda x: x-1)) >>> p.transform(np.array([[1,2]])) array([[1, 3]]) ``` Could somebody explain that to me? Best, Louis Abraham -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Sun Oct 28 01:33:48 2018 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Sun, 28 Oct 2018 00:33:48 -0500 Subject: [scikit-learn] How does the random state influence the decision tree splits? In-Reply-To: References: <2675225F-ABAF-4898-B780-7F77D8E808A7@sebastianraschka.com> <49701ADA-9C16-463C-B78F-8F5F18BCAFA6@sebastianraschka.com> Message-ID: <159253A2-F0B6-4341-90F7-DBABD9A6F04C@sebastianraschka.com> Good suggestion. The trees look different. I.e., there seems to be a tie at some point between choosing X[:, 0] <= 4.95 and X[:, 3] <= 1.65 So, I suspect that the features are shuffled, let's call it X_shuffled. Then at some point the max_features are selected, which is by default X_shuffled[:, :n_features]. Based on that, if there's a tie between impurities for the different features, it's probably selecting the first feature in the array among these ties. If this is true (have to look into the code more deeply then) I wonder if it would be worthwhile to change the implementation such that the shuffling only occurs if max_features < n_feature, because this way we could have deterministic behavior for the trees by default, which I'd find more intuitive for plain decision trees tbh. Let me know what you all think. Best, Sebastian > On Oct 27, 2018, at 11:07 PM, Julio Antonio Soto de Vicente wrote: > > Hmmm that?s weird... > > Have you tried to plot the trees (the decision rules) for the tree with different seeds, and see if the gain for the first split is the same even if the split itself is different? > > I?d at least try that before diving into the source code... > > Cheers, > > -- > Julio > >> El 28 oct 2018, a las 2:24, Sebastian Raschka escribi?: >> >> Thanks, Javier, >> >> however, the max_features is n_features by default. But if you execute sth like >> >> import numpy as np >> from sklearn.datasets import load_iris >> from sklearn.model_selection import train_test_split >> from sklearn.tree import DecisionTreeClassifier >> >> iris = load_iris() >> X, y = iris.data, iris.target >> X_train, X_test, y_train, y_test = train_test_split(X, y, >> test_size=0.3, >> random_state=123, >> shuffle=True, >> stratify=y) >> >> for i in range(20): >> tree = DecisionTreeClassifier() >> tree.fit(X_train, y_train) >> print(tree.score(X_test, y_test)) >> >> >> >> You will find that the tree will produce different results if you don't fix the random seed. I suspect, related to what you said about the random feature selection if max_features is not n_features, that there is generally some sorting of the features going on, and the different trees are then due to tie-breaking if two features have the same information gain? >> >> Best, >> Sebastian >> >> >> >>> On Oct 27, 2018, at 6:16 PM, Javier L?pez wrote: >>> >>> Hi Sebastian, >>> >>> I think the random state is used to select the features that go into each split (look at the `max_features` parameter) >>> >>> Cheers, >>> Javier >>> >>> On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka wrote: >>> Hi all, >>> >>> when I was implementing a bagging classifier based on scikit-learn's DecisionTreeClassifier, I noticed that the results were not deterministic and found that this was due to the random_state in the DescisionTreeClassifier (which is set to None by default). >>> >>> I am wondering what exactly this random state is used for? I can imaging it being used for resolving ties if the information gain for multiple features is the same, or it could be that the feature splits of continuous features is different? (I thought the heuristic is to sort the features and to consider those feature values next to each associated with examples that have different class labels -- but is there maybe some random subselection involved?) >>> >>> If someone knows more about this, where the random_state is used, I'd be happy to hear it :) >>> >>> Also, we could then maybe add the info to the DecisionTreeClassifier's docstring, which is currently a bit too generic to be useful, I think: >>> >>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py >>> >>> >>> random_state : int, RandomState instance or None, optional (default=None) >>> If int, random_state is the seed used by the random number generator; >>> If RandomState instance, random_state is the random number generator; >>> If None, the random number generator is the RandomState instance used >>> by `np.random`. >>> >>> >>> Best, >>> Sebastian >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From niedakh at gmail.com Sun Oct 28 03:31:29 2018 From: niedakh at gmail.com (=?UTF-8?Q?Piotr_Szyma=C5=84ski?=) Date: Sun, 28 Oct 2018 08:31:29 +0100 Subject: [scikit-learn] How does the random state influence the decision tree splits? In-Reply-To: <159253A2-F0B6-4341-90F7-DBABD9A6F04C@sebastianraschka.com> References: <2675225F-ABAF-4898-B780-7F77D8E808A7@sebastianraschka.com> <49701ADA-9C16-463C-B78F-8F5F18BCAFA6@sebastianraschka.com> <159253A2-F0B6-4341-90F7-DBABD9A6F04C@sebastianraschka.com> Message-ID: Just a small side note that I've come across with Random Forests which in the end form an ensemble of Decision Trees. I ran a thousand iterations of RFs on multi-label data and managed to get a 4-10 percentage points difference in subset accuracy, depending on the data set, just as a random effect, while I've seen papers report differences of just a couple pp as statistically significant after a non-parametric rank test. On Sun, Oct 28, 2018 at 7:44 AM Sebastian Raschka wrote: > Good suggestion. The trees look different. I.e., there seems to be a tie > at some point between choosing X[:, 0] <= 4.95 and X[:, 3] <= 1.65 > > So, I suspect that the features are shuffled, let's call it X_shuffled. > Then at some point the max_features are selected, which is by default > X_shuffled[:, :n_features]. Based on that, if there's a tie between > impurities for the different features, it's probably selecting the first > feature in the array among these ties. > > If this is true (have to look into the code more deeply then) I wonder if > it would be worthwhile to change the implementation such that the shuffling > only occurs if max_features < n_feature, because this way we could have > deterministic behavior for the trees by default, which I'd find more > intuitive for plain decision trees tbh. > > Let me know what you all think. > > Best, > Sebastian > > > On Oct 27, 2018, at 11:07 PM, Julio Antonio Soto de Vicente < > julio at esbet.es> wrote: > > > > Hmmm that?s weird... > > > > Have you tried to plot the trees (the decision rules) for the tree with > different seeds, and see if the gain for the first split is the same even > if the split itself is different? > > > > I?d at least try that before diving into the source code... > > > > Cheers, > > > > -- > > Julio > > > >> El 28 oct 2018, a las 2:24, Sebastian Raschka < > mail at sebastianraschka.com> escribi?: > >> > >> Thanks, Javier, > >> > >> however, the max_features is n_features by default. But if you execute > sth like > >> > >> import numpy as np > >> from sklearn.datasets import load_iris > >> from sklearn.model_selection import train_test_split > >> from sklearn.tree import DecisionTreeClassifier > >> > >> iris = load_iris() > >> X, y = iris.data, iris.target > >> X_train, X_test, y_train, y_test = train_test_split(X, y, > >> test_size=0.3, > >> random_state=123, > >> shuffle=True, > >> stratify=y) > >> > >> for i in range(20): > >> tree = DecisionTreeClassifier() > >> tree.fit(X_train, y_train) > >> print(tree.score(X_test, y_test)) > >> > >> > >> > >> You will find that the tree will produce different results if you don't > fix the random seed. I suspect, related to what you said about the random > feature selection if max_features is not n_features, that there is > generally some sorting of the features going on, and the different trees > are then due to tie-breaking if two features have the same information gain? > >> > >> Best, > >> Sebastian > >> > >> > >> > >>> On Oct 27, 2018, at 6:16 PM, Javier L?pez wrote: > >>> > >>> Hi Sebastian, > >>> > >>> I think the random state is used to select the features that go into > each split (look at the `max_features` parameter) > >>> > >>> Cheers, > >>> Javier > >>> > >>> On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka < > mail at sebastianraschka.com> wrote: > >>> Hi all, > >>> > >>> when I was implementing a bagging classifier based on scikit-learn's > DecisionTreeClassifier, I noticed that the results were not deterministic > and found that this was due to the random_state in the > DescisionTreeClassifier (which is set to None by default). > >>> > >>> I am wondering what exactly this random state is used for? I can > imaging it being used for resolving ties if the information gain for > multiple features is the same, or it could be that the feature splits of > continuous features is different? (I thought the heuristic is to sort the > features and to consider those feature values next to each associated with > examples that have different class labels -- but is there maybe some random > subselection involved?) > >>> > >>> If someone knows more about this, where the random_state is used, I'd > be happy to hear it :) > >>> > >>> Also, we could then maybe add the info to the DecisionTreeClassifier's > docstring, which is currently a bit too generic to be useful, I think: > >>> > >>> > https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py > >>> > >>> > >>> random_state : int, RandomState instance or None, optional > (default=None) > >>> If int, random_state is the seed used by the random number > generator; > >>> If RandomState instance, random_state is the random number > generator; > >>> If None, the random number generator is the RandomState instance > used > >>> by `np.random`. > >>> > >>> > >>> Best, > >>> Sebastian > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Piotr Szyma?ski niedakh at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From fernando.wittmann at gmail.com Sun Oct 28 04:20:21 2018 From: fernando.wittmann at gmail.com (Fernando Marcos Wittmann) Date: Sun, 28 Oct 2018 10:20:21 +0200 Subject: [scikit-learn] How does the random state influence the decision tree splits? In-Reply-To: References: <2675225F-ABAF-4898-B780-7F77D8E808A7@sebastianraschka.com> <49701ADA-9C16-463C-B78F-8F5F18BCAFA6@sebastianraschka.com> <159253A2-F0B6-4341-90F7-DBABD9A6F04C@sebastianraschka.com> Message-ID: The random_state is used in the splitters: SPLITTERS = SPARSE_SPLITTERS if issparse(X) else DENSE_SPLITTERS splitter = self.splitter if not isinstance(self.splitter, Splitter): splitter = SPLITTERS[self.splitter](criterion, self.max_features_, min_samples_leaf, min_weight_leaf, random_state, self.presort) Which is defined as: DENSE_SPLITTERS = {"best": _splitter.BestSplitter, "random": _splitter.RandomSplitter} SPARSE_SPLITTERS = {"best": _splitter.BestSparseSplitter, "random": _splitter.RandomSparseSplitter} Both 'best' and 'random' uses random states. The DecisionTreeClassifier uses 'best' as default `splitter` parameter. I am not sure how this 'best' strategy was defined. The docs define as "Supported strategies are ?best?. On Sun, Oct 28, 2018 at 9:32 AM Piotr Szyma?ski wrote: > Just a small side note that I've come across with Random Forests which in > the end form an ensemble of Decision Trees. I ran a thousand iterations of > RFs on multi-label data and managed to get a 4-10 percentage points > difference in subset accuracy, depending on the data set, just as a random > effect, while I've seen papers report differences of just a couple pp as > statistically significant after a non-parametric rank test. > > On Sun, Oct 28, 2018 at 7:44 AM Sebastian Raschka < > mail at sebastianraschka.com> wrote: > >> Good suggestion. The trees look different. I.e., there seems to be a tie >> at some point between choosing X[:, 0] <= 4.95 and X[:, 3] <= 1.65 >> >> So, I suspect that the features are shuffled, let's call it X_shuffled. >> Then at some point the max_features are selected, which is by default >> X_shuffled[:, :n_features]. Based on that, if there's a tie between >> impurities for the different features, it's probably selecting the first >> feature in the array among these ties. >> >> If this is true (have to look into the code more deeply then) I wonder if >> it would be worthwhile to change the implementation such that the shuffling >> only occurs if max_features < n_feature, because this way we could have >> deterministic behavior for the trees by default, which I'd find more >> intuitive for plain decision trees tbh. >> >> Let me know what you all think. >> >> Best, >> Sebastian >> >> > On Oct 27, 2018, at 11:07 PM, Julio Antonio Soto de Vicente < >> julio at esbet.es> wrote: >> > >> > Hmmm that?s weird... >> > >> > Have you tried to plot the trees (the decision rules) for the tree with >> different seeds, and see if the gain for the first split is the same even >> if the split itself is different? >> > >> > I?d at least try that before diving into the source code... >> > >> > Cheers, >> > >> > -- >> > Julio >> > >> >> El 28 oct 2018, a las 2:24, Sebastian Raschka < >> mail at sebastianraschka.com> escribi?: >> >> >> >> Thanks, Javier, >> >> >> >> however, the max_features is n_features by default. But if you execute >> sth like >> >> >> >> import numpy as np >> >> from sklearn.datasets import load_iris >> >> from sklearn.model_selection import train_test_split >> >> from sklearn.tree import DecisionTreeClassifier >> >> >> >> iris = load_iris() >> >> X, y = iris.data, iris.target >> >> X_train, X_test, y_train, y_test = train_test_split(X, y, >> >> test_size=0.3, >> >> random_state=123, >> >> shuffle=True, >> >> stratify=y) >> >> >> >> for i in range(20): >> >> tree = DecisionTreeClassifier() >> >> tree.fit(X_train, y_train) >> >> print(tree.score(X_test, y_test)) >> >> >> >> >> >> >> >> You will find that the tree will produce different results if you >> don't fix the random seed. I suspect, related to what you said about the >> random feature selection if max_features is not n_features, that there is >> generally some sorting of the features going on, and the different trees >> are then due to tie-breaking if two features have the same information gain? >> >> >> >> Best, >> >> Sebastian >> >> >> >> >> >> >> >>> On Oct 27, 2018, at 6:16 PM, Javier L?pez wrote: >> >>> >> >>> Hi Sebastian, >> >>> >> >>> I think the random state is used to select the features that go into >> each split (look at the `max_features` parameter) >> >>> >> >>> Cheers, >> >>> Javier >> >>> >> >>> On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka < >> mail at sebastianraschka.com> wrote: >> >>> Hi all, >> >>> >> >>> when I was implementing a bagging classifier based on scikit-learn's >> DecisionTreeClassifier, I noticed that the results were not deterministic >> and found that this was due to the random_state in the >> DescisionTreeClassifier (which is set to None by default). >> >>> >> >>> I am wondering what exactly this random state is used for? I can >> imaging it being used for resolving ties if the information gain for >> multiple features is the same, or it could be that the feature splits of >> continuous features is different? (I thought the heuristic is to sort the >> features and to consider those feature values next to each associated with >> examples that have different class labels -- but is there maybe some random >> subselection involved?) >> >>> >> >>> If someone knows more about this, where the random_state is used, I'd >> be happy to hear it :) >> >>> >> >>> Also, we could then maybe add the info to the >> DecisionTreeClassifier's docstring, which is currently a bit too generic to >> be useful, I think: >> >>> >> >>> >> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py >> >>> >> >>> >> >>> random_state : int, RandomState instance or None, optional >> (default=None) >> >>> If int, random_state is the seed used by the random number >> generator; >> >>> If RandomState instance, random_state is the random number >> generator; >> >>> If None, the random number generator is the RandomState >> instance used >> >>> by `np.random`. >> >>> >> >>> >> >>> Best, >> >>> Sebastian >> >>> _______________________________________________ >> >>> scikit-learn mailing list >> >>> scikit-learn at python.org >> >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >>> _______________________________________________ >> >>> scikit-learn mailing list >> >>> scikit-learn at python.org >> >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> >> scikit-learn mailing list >> >> scikit-learn at python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > Piotr Szyma?ski > niedakh at gmail.com > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Fernando Marcos Wittmann MS Student - Energy Systems Dept. School of Electrical and Computer Engineering, FEEC University of Campinas, UNICAMP, Brazil +55 (19) 987-211302 -------------- next part -------------- An HTML attachment was scrubbed... URL: From louis.abraham at yahoo.fr Sun Oct 28 04:29:21 2018 From: louis.abraham at yahoo.fr (Louis Abraham) Date: Sun, 28 Oct 2018 09:29:21 +0100 Subject: [scikit-learn] Question about get_params / set_params Message-ID: Hi, According to http://scikit-learn.org/0.16/developers/index.html#get-params-and-set-params , get_params and set_params are used to clone estimators. However, I don't understand how it is used in FeatureUnion: `return self._get_params('transformer_list', deep=deep)` Why doesn't it contain other arguments like n_jobs and transformer_weights? Best Louis -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Sun Oct 28 04:32:25 2018 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Sun, 28 Oct 2018 09:32:25 +0100 Subject: [scikit-learn] How does the random state influence the decision tree splits? In-Reply-To: References: <2675225F-ABAF-4898-B780-7F77D8E808A7@sebastianraschka.com> <49701ADA-9C16-463C-B78F-8F5F18BCAFA6@sebastianraschka.com> <159253A2-F0B6-4341-90F7-DBABD9A6F04C@sebastianraschka.com> Message-ID: There is always a shuffling when iteration over the features (even when going to all features). So in the case of a tie the split will be done on the first feature encounter which will be different due to the shuffling. There is a PR which was intending to make the algorithm deterministic to always select the same feature in the case of tie. On Sun, 28 Oct 2018 at 09:22, Fernando Marcos Wittmann < fernando.wittmann at gmail.com> wrote: > The random_state is used in the splitters: > > SPLITTERS = SPARSE_SPLITTERS if issparse(X) else DENSE_SPLITTERS > > splitter = self.splitter > if not isinstance(self.splitter, Splitter): > splitter = SPLITTERS[self.splitter](criterion, > self.max_features_, > min_samples_leaf, > min_weight_leaf, > random_state, > self.presort) > > Which is defined as: > > DENSE_SPLITTERS = {"best": _splitter.BestSplitter, > "random": _splitter.RandomSplitter} > > SPARSE_SPLITTERS = {"best": _splitter.BestSparseSplitter, > "random": _splitter.RandomSparseSplitter} > > > Both 'best' and 'random' uses random states. The DecisionTreeClassifier > uses 'best' as default `splitter` parameter. I am not sure how this 'best' > strategy was defined. The docs define as "Supported strategies are ?best?. > > > > > On Sun, Oct 28, 2018 at 9:32 AM Piotr Szyma?ski wrote: > >> Just a small side note that I've come across with Random Forests which in >> the end form an ensemble of Decision Trees. I ran a thousand iterations of >> RFs on multi-label data and managed to get a 4-10 percentage points >> difference in subset accuracy, depending on the data set, just as a random >> effect, while I've seen papers report differences of just a couple pp as >> statistically significant after a non-parametric rank test. >> >> On Sun, Oct 28, 2018 at 7:44 AM Sebastian Raschka < >> mail at sebastianraschka.com> wrote: >> >>> Good suggestion. The trees look different. I.e., there seems to be a tie >>> at some point between choosing X[:, 0] <= 4.95 and X[:, 3] <= 1.65 >>> >>> So, I suspect that the features are shuffled, let's call it X_shuffled. >>> Then at some point the max_features are selected, which is by default >>> X_shuffled[:, :n_features]. Based on that, if there's a tie between >>> impurities for the different features, it's probably selecting the first >>> feature in the array among these ties. >>> >>> If this is true (have to look into the code more deeply then) I wonder >>> if it would be worthwhile to change the implementation such that the >>> shuffling only occurs if max_features < n_feature, because this way we >>> could have deterministic behavior for the trees by default, which I'd find >>> more intuitive for plain decision trees tbh. >>> >>> Let me know what you all think. >>> >>> Best, >>> Sebastian >>> >>> > On Oct 27, 2018, at 11:07 PM, Julio Antonio Soto de Vicente < >>> julio at esbet.es> wrote: >>> > >>> > Hmmm that?s weird... >>> > >>> > Have you tried to plot the trees (the decision rules) for the tree >>> with different seeds, and see if the gain for the first split is the same >>> even if the split itself is different? >>> > >>> > I?d at least try that before diving into the source code... >>> > >>> > Cheers, >>> > >>> > -- >>> > Julio >>> > >>> >> El 28 oct 2018, a las 2:24, Sebastian Raschka < >>> mail at sebastianraschka.com> escribi?: >>> >> >>> >> Thanks, Javier, >>> >> >>> >> however, the max_features is n_features by default. But if you >>> execute sth like >>> >> >>> >> import numpy as np >>> >> from sklearn.datasets import load_iris >>> >> from sklearn.model_selection import train_test_split >>> >> from sklearn.tree import DecisionTreeClassifier >>> >> >>> >> iris = load_iris() >>> >> X, y = iris.data, iris.target >>> >> X_train, X_test, y_train, y_test = train_test_split(X, y, >>> >> test_size=0.3, >>> >> random_state=123, >>> >> shuffle=True, >>> >> stratify=y) >>> >> >>> >> for i in range(20): >>> >> tree = DecisionTreeClassifier() >>> >> tree.fit(X_train, y_train) >>> >> print(tree.score(X_test, y_test)) >>> >> >>> >> >>> >> >>> >> You will find that the tree will produce different results if you >>> don't fix the random seed. I suspect, related to what you said about the >>> random feature selection if max_features is not n_features, that there is >>> generally some sorting of the features going on, and the different trees >>> are then due to tie-breaking if two features have the same information gain? >>> >> >>> >> Best, >>> >> Sebastian >>> >> >>> >> >>> >> >>> >>> On Oct 27, 2018, at 6:16 PM, Javier L?pez wrote: >>> >>> >>> >>> Hi Sebastian, >>> >>> >>> >>> I think the random state is used to select the features that go into >>> each split (look at the `max_features` parameter) >>> >>> >>> >>> Cheers, >>> >>> Javier >>> >>> >>> >>> On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka < >>> mail at sebastianraschka.com> wrote: >>> >>> Hi all, >>> >>> >>> >>> when I was implementing a bagging classifier based on scikit-learn's >>> DecisionTreeClassifier, I noticed that the results were not deterministic >>> and found that this was due to the random_state in the >>> DescisionTreeClassifier (which is set to None by default). >>> >>> >>> >>> I am wondering what exactly this random state is used for? I can >>> imaging it being used for resolving ties if the information gain for >>> multiple features is the same, or it could be that the feature splits of >>> continuous features is different? (I thought the heuristic is to sort the >>> features and to consider those feature values next to each associated with >>> examples that have different class labels -- but is there maybe some random >>> subselection involved?) >>> >>> >>> >>> If someone knows more about this, where the random_state is used, >>> I'd be happy to hear it :) >>> >>> >>> >>> Also, we could then maybe add the info to the >>> DecisionTreeClassifier's docstring, which is currently a bit too generic to >>> be useful, I think: >>> >>> >>> >>> >>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py >>> >>> >>> >>> >>> >>> random_state : int, RandomState instance or None, optional >>> (default=None) >>> >>> If int, random_state is the seed used by the random number >>> generator; >>> >>> If RandomState instance, random_state is the random number >>> generator; >>> >>> If None, the random number generator is the RandomState >>> instance used >>> >>> by `np.random`. >>> >>> >>> >>> >>> >>> Best, >>> >>> Sebastian >>> >>> _______________________________________________ >>> >>> scikit-learn mailing list >>> >>> scikit-learn at python.org >>> >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> >>> scikit-learn mailing list >>> >>> scikit-learn at python.org >>> >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >>> >> _______________________________________________ >>> >> scikit-learn mailing list >>> >> scikit-learn at python.org >>> >> https://mail.python.org/mailman/listinfo/scikit-learn >>> > _______________________________________________ >>> > scikit-learn mailing list >>> > scikit-learn at python.org >>> > https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> -- >> Piotr Szyma?ski >> niedakh at gmail.com >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > > Fernando Marcos Wittmann > MS Student - Energy Systems Dept. > School of Electrical and Computer Engineering, FEEC > University of Campinas, UNICAMP, Brazil > +55 (19) 987-211302 > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Sun Oct 28 04:34:41 2018 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Sun, 28 Oct 2018 09:34:41 +0100 Subject: [scikit-learn] How does the random state influence the decision tree splits? In-Reply-To: References: <2675225F-ABAF-4898-B780-7F77D8E808A7@sebastianraschka.com> <49701ADA-9C16-463C-B78F-8F5F18BCAFA6@sebastianraschka.com> <159253A2-F0B6-4341-90F7-DBABD9A6F04C@sebastianraschka.com> Message-ID: FYI: https://github.com/scikit-learn/scikit-learn/pull/12364 On Sun, 28 Oct 2018 at 09:32, Guillaume Lema?tre wrote: > There is always a shuffling when iteration over the features (even when > going to all features). > So in the case of a tie the split will be done on the first feature > encounter which will be different due to the shuffling. > > There is a PR which was intending to make the algorithm deterministic to > always select the same feature in the case of tie. > > On Sun, 28 Oct 2018 at 09:22, Fernando Marcos Wittmann < > fernando.wittmann at gmail.com> wrote: > >> The random_state is used in the splitters: >> >> SPLITTERS = SPARSE_SPLITTERS if issparse(X) else DENSE_SPLITTERS >> >> splitter = self.splitter >> if not isinstance(self.splitter, Splitter): >> splitter = SPLITTERS[self.splitter](criterion, >> self.max_features_, >> min_samples_leaf, >> min_weight_leaf, >> random_state, >> self.presort) >> >> Which is defined as: >> >> DENSE_SPLITTERS = {"best": _splitter.BestSplitter, >> "random": _splitter.RandomSplitter} >> >> SPARSE_SPLITTERS = {"best": _splitter.BestSparseSplitter, >> "random": _splitter.RandomSparseSplitter} >> >> >> Both 'best' and 'random' uses random states. The DecisionTreeClassifier >> uses 'best' as default `splitter` parameter. I am not sure how this 'best' >> strategy was defined. The docs define as "Supported strategies are ?best?. >> >> >> >> >> On Sun, Oct 28, 2018 at 9:32 AM Piotr Szyma?ski >> wrote: >> >>> Just a small side note that I've come across with Random Forests which >>> in the end form an ensemble of Decision Trees. I ran a thousand iterations >>> of RFs on multi-label data and managed to get a 4-10 percentage points >>> difference in subset accuracy, depending on the data set, just as a random >>> effect, while I've seen papers report differences of just a couple pp as >>> statistically significant after a non-parametric rank test. >>> >>> On Sun, Oct 28, 2018 at 7:44 AM Sebastian Raschka < >>> mail at sebastianraschka.com> wrote: >>> >>>> Good suggestion. The trees look different. I.e., there seems to be a >>>> tie at some point between choosing X[:, 0] <= 4.95 and X[:, 3] <= 1.65 >>>> >>>> So, I suspect that the features are shuffled, let's call it X_shuffled. >>>> Then at some point the max_features are selected, which is by default >>>> X_shuffled[:, :n_features]. Based on that, if there's a tie between >>>> impurities for the different features, it's probably selecting the first >>>> feature in the array among these ties. >>>> >>>> If this is true (have to look into the code more deeply then) I wonder >>>> if it would be worthwhile to change the implementation such that the >>>> shuffling only occurs if max_features < n_feature, because this way we >>>> could have deterministic behavior for the trees by default, which I'd find >>>> more intuitive for plain decision trees tbh. >>>> >>>> Let me know what you all think. >>>> >>>> Best, >>>> Sebastian >>>> >>>> > On Oct 27, 2018, at 11:07 PM, Julio Antonio Soto de Vicente < >>>> julio at esbet.es> wrote: >>>> > >>>> > Hmmm that?s weird... >>>> > >>>> > Have you tried to plot the trees (the decision rules) for the tree >>>> with different seeds, and see if the gain for the first split is the same >>>> even if the split itself is different? >>>> > >>>> > I?d at least try that before diving into the source code... >>>> > >>>> > Cheers, >>>> > >>>> > -- >>>> > Julio >>>> > >>>> >> El 28 oct 2018, a las 2:24, Sebastian Raschka < >>>> mail at sebastianraschka.com> escribi?: >>>> >> >>>> >> Thanks, Javier, >>>> >> >>>> >> however, the max_features is n_features by default. But if you >>>> execute sth like >>>> >> >>>> >> import numpy as np >>>> >> from sklearn.datasets import load_iris >>>> >> from sklearn.model_selection import train_test_split >>>> >> from sklearn.tree import DecisionTreeClassifier >>>> >> >>>> >> iris = load_iris() >>>> >> X, y = iris.data, iris.target >>>> >> X_train, X_test, y_train, y_test = train_test_split(X, y, >>>> >> test_size=0.3, >>>> >> random_state=123, >>>> >> shuffle=True, >>>> >> stratify=y) >>>> >> >>>> >> for i in range(20): >>>> >> tree = DecisionTreeClassifier() >>>> >> tree.fit(X_train, y_train) >>>> >> print(tree.score(X_test, y_test)) >>>> >> >>>> >> >>>> >> >>>> >> You will find that the tree will produce different results if you >>>> don't fix the random seed. I suspect, related to what you said about the >>>> random feature selection if max_features is not n_features, that there is >>>> generally some sorting of the features going on, and the different trees >>>> are then due to tie-breaking if two features have the same information gain? >>>> >> >>>> >> Best, >>>> >> Sebastian >>>> >> >>>> >> >>>> >> >>>> >>> On Oct 27, 2018, at 6:16 PM, Javier L?pez wrote: >>>> >>> >>>> >>> Hi Sebastian, >>>> >>> >>>> >>> I think the random state is used to select the features that go >>>> into each split (look at the `max_features` parameter) >>>> >>> >>>> >>> Cheers, >>>> >>> Javier >>>> >>> >>>> >>> On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka < >>>> mail at sebastianraschka.com> wrote: >>>> >>> Hi all, >>>> >>> >>>> >>> when I was implementing a bagging classifier based on >>>> scikit-learn's DecisionTreeClassifier, I noticed that the results were not >>>> deterministic and found that this was due to the random_state in the >>>> DescisionTreeClassifier (which is set to None by default). >>>> >>> >>>> >>> I am wondering what exactly this random state is used for? I can >>>> imaging it being used for resolving ties if the information gain for >>>> multiple features is the same, or it could be that the feature splits of >>>> continuous features is different? (I thought the heuristic is to sort the >>>> features and to consider those feature values next to each associated with >>>> examples that have different class labels -- but is there maybe some random >>>> subselection involved?) >>>> >>> >>>> >>> If someone knows more about this, where the random_state is used, >>>> I'd be happy to hear it :) >>>> >>> >>>> >>> Also, we could then maybe add the info to the >>>> DecisionTreeClassifier's docstring, which is currently a bit too generic to >>>> be useful, I think: >>>> >>> >>>> >>> >>>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py >>>> >>> >>>> >>> >>>> >>> random_state : int, RandomState instance or None, optional >>>> (default=None) >>>> >>> If int, random_state is the seed used by the random number >>>> generator; >>>> >>> If RandomState instance, random_state is the random number >>>> generator; >>>> >>> If None, the random number generator is the RandomState >>>> instance used >>>> >>> by `np.random`. >>>> >>> >>>> >>> >>>> >>> Best, >>>> >>> Sebastian >>>> >>> _______________________________________________ >>>> >>> scikit-learn mailing list >>>> >>> scikit-learn at python.org >>>> >>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> _______________________________________________ >>>> >>> scikit-learn mailing list >>>> >>> scikit-learn at python.org >>>> >>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >> >>>> >> _______________________________________________ >>>> >> scikit-learn mailing list >>>> >> scikit-learn at python.org >>>> >> https://mail.python.org/mailman/listinfo/scikit-learn >>>> > _______________________________________________ >>>> > scikit-learn mailing list >>>> > scikit-learn at python.org >>>> > https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> >>> >>> -- >>> Piotr Szyma?ski >>> niedakh at gmail.com >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> -- >> >> Fernando Marcos Wittmann >> MS Student - Energy Systems Dept. >> School of Electrical and Computer Engineering, FEEC >> University of Campinas, UNICAMP, Brazil >> +55 (19) 987-211302 >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Sun Oct 28 05:37:57 2018 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Sun, 28 Oct 2018 10:37:57 +0100 Subject: [scikit-learn] Question about get_params / set_params In-Reply-To: References: Message-ID: On Sun, 28 Oct 2018 at 09:31, Louis Abraham via scikit-learn < scikit-learn at python.org> wrote: > Hi, > > According to > http://scikit-learn.org/0.16/developers/index.html#get-params-and-set-params > , > get_params and set_params are used to clone estimators. > sklearn.base.clone is function used for cloning. get_params and set_params are accessors to attributes of an estimator and are defined by BaseEstimator. For Pipeline and FeatureUnion, those accessors rely on the _BaseComposition which manage the access to attributes to the sub-estimators. > However, I don't understand how it is used in FeatureUnion: > `return self._get_params('transformer_list', deep=deep)` > transformer_list contain all the estimators used in the FeatureUnion, and the _BaseComposition allow you to access the parameters of each transformer. > > Why doesn't it contain other arguments like n_jobs and transformer_weights? > The first line in _get_params in _BaseCompositin will list the attributes of FeatureUnion; https://github.com/scikit-learn/scikit-learn/blob/06ac22d06f54353ea5d5bba244371474c7baf938/sklearn/utils/metaestimators.py#L26 For instance: In [5]: trans = FeatureUnion([('trans1', StandardScaler()), ('trans2', MinMaxScaler())]) In [6]: trans.get_params() Out[6]: {'n_jobs': None, 'transformer_list': [('trans1', StandardScaler(copy=True, with_mean=True, with_std=True)), ('trans2', MinMaxScaler(copy=True, feature_range=(0, 1)))], 'transformer_weights': None, 'trans1': StandardScaler(copy=True, with_mean=True, with_std=True), 'trans2': MinMaxScaler(copy=True, feature_range=(0, 1)), 'trans1__copy': True, 'trans1__with_mean': True, 'trans1__with_std': True, 'trans2__copy': True, 'trans2__feature_range': (0, 1)} Then, n_jobs and transformer_weights are accessible. > > Best > Louis > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Sun Oct 28 05:44:54 2018 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Sun, 28 Oct 2018 10:44:54 +0100 Subject: [scikit-learn] Strange code but that works In-Reply-To: <44746C14-07A8-4FEF-BFF1-1706F9D7CDAE@yahoo.fr> References: <44746C14-07A8-4FEF-BFF1-1706F9D7CDAE@yahoo.fr> Message-ID: On Sun, 28 Oct 2018 at 07:42, Louis Abraham via scikit-learn < scikit-learn at python.org> wrote: > Hi, > > This is a code from sklearn.pipeline.Pipeline: > @property > def transform(self): > """Apply transforms, and transform with the final estimator > > This also works where final estimator is ``None``: all prior > transformations are applied. > > Parameters > ---------- > X : iterable > Data to transform. Must fulfill input requirements of first step > of the pipeline. > > Returns > ------- > Xt : array-like, shape = [n_samples, n_transformed_features] > """ > # _final_estimator is None or has transform, otherwise attribute error > # XXX: Handling the None case means we can't use if_delegate_has_method > if self._final_estimator is not None: > self._final_estimator.transform > return self._transform > > I don't understand why `self._final_estimator.transform` can be returned, > ignoring all the previous transformers. > It is not returned. It is called such that if the final estimator does not implement a transform method then it will raise an error. Otherwise, _transform is called, which is actually perform all the transform of all transformer (except the one that are set to None) This is actually what the comment is referring to above (_final_estimator is None or has transform, otherwise attribute error). > However, when testing it works: > > ``` > >>> p = make_pipeline(FunctionTransformer(lambda x: 2*x), > FunctionTransformer(lambda x: x-1)) > >>> p.transform(np.array([[1,2]])) > array([[1, 3]]) > ``` > > Could somebody explain that to me? > > Best, > Louis Abraham > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Sun Oct 28 12:03:09 2018 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Sun, 28 Oct 2018 11:03:09 -0500 Subject: [scikit-learn] How does the random state influence the decision tree splits? In-Reply-To: References: <2675225F-ABAF-4898-B780-7F77D8E808A7@sebastianraschka.com> <49701ADA-9C16-463C-B78F-8F5F18BCAFA6@sebastianraschka.com> <159253A2-F0B6-4341-90F7-DBABD9A6F04C@sebastianraschka.com>

Message-ID: <58E9A9FC-BD53-4F5D-86DE-C55046C144D4@sebastianraschka.com> That's nice to know, thanks a lot for the reference! Best, Sebastian > On Oct 28, 2018, at 3:34 AM, Guillaume Lema?tre wrote: > > FYI: https://github.com/scikit-learn/scikit-learn/pull/12364 > > On Sun, 28 Oct 2018 at 09:32, Guillaume Lema?tre wrote: > There is always a shuffling when iteration over the features (even when going to all features). > So in the case of a tie the split will be done on the first feature encounter which will be different due to the shuffling. > > There is a PR which was intending to make the algorithm deterministic to always select the same feature in the case of tie. > > On Sun, 28 Oct 2018 at 09:22, Fernando Marcos Wittmann wrote: > The random_state is used in the splitters: > > SPLITTERS = SPARSE_SPLITTERS if issparse(X) else DENSE_SPLITTERS > > splitter = self.splitter > if not isinstance(self.splitter, Splitter): > splitter = SPLITTERS[self.splitter](criterion, > self.max_features_, > min_samples_leaf, > min_weight_leaf, > random_state, > self.presort) > > Which is defined as: > > DENSE_SPLITTERS = {"best": _splitter.BestSplitter, > "random": _splitter.RandomSplitter} > > SPARSE_SPLITTERS = {"best": _splitter.BestSparseSplitter, > "random": _splitter.RandomSparseSplitter} > > Both 'best' and 'random' uses random states. The DecisionTreeClassifier uses 'best' as default `splitter` parameter. I am not sure how this 'best' strategy was defined. The docs define as "Supported strategies are ?best?. > > > > > On Sun, Oct 28, 2018 at 9:32 AM Piotr Szyma?ski wrote: > Just a small side note that I've come across with Random Forests which in the end form an ensemble of Decision Trees. I ran a thousand iterations of RFs on multi-label data and managed to get a 4-10 percentage points difference in subset accuracy, depending on the data set, just as a random effect, while I've seen papers report differences of just a couple pp as statistically significant after a non-parametric rank test. > > On Sun, Oct 28, 2018 at 7:44 AM Sebastian Raschka wrote: > Good suggestion. The trees look different. I.e., there seems to be a tie at some point between choosing X[:, 0] <= 4.95 and X[:, 3] <= 1.65 > > So, I suspect that the features are shuffled, let's call it X_shuffled. Then at some point the max_features are selected, which is by default X_shuffled[:, :n_features]. Based on that, if there's a tie between impurities for the different features, it's probably selecting the first feature in the array among these ties. > > If this is true (have to look into the code more deeply then) I wonder if it would be worthwhile to change the implementation such that the shuffling only occurs if max_features < n_feature, because this way we could have deterministic behavior for the trees by default, which I'd find more intuitive for plain decision trees tbh. > > Let me know what you all think. > > Best, > Sebastian > > > On Oct 27, 2018, at 11:07 PM, Julio Antonio Soto de Vicente wrote: > > > > Hmmm that?s weird... > > > > Have you tried to plot the trees (the decision rules) for the tree with different seeds, and see if the gain for the first split is the same even if the split itself is different? > > > > I?d at least try that before diving into the source code... > > > > Cheers, > > > > -- > > Julio > > > >> El 28 oct 2018, a las 2:24, Sebastian Raschka escribi?: > >> > >> Thanks, Javier, > >> > >> however, the max_features is n_features by default. But if you execute sth like > >> > >> import numpy as np > >> from sklearn.datasets import load_iris > >> from sklearn.model_selection import train_test_split > >> from sklearn.tree import DecisionTreeClassifier > >> > >> iris = load_iris() > >> X, y = iris.data, iris.target > >> X_train, X_test, y_train, y_test = train_test_split(X, y, > >> test_size=0.3, > >> random_state=123, > >> shuffle=True, > >> stratify=y) > >> > >> for i in range(20): > >> tree = DecisionTreeClassifier() > >> tree.fit(X_train, y_train) > >> print(tree.score(X_test, y_test)) > >> > >> > >> > >> You will find that the tree will produce different results if you don't fix the random seed. I suspect, related to what you said about the random feature selection if max_features is not n_features, that there is generally some sorting of the features going on, and the different trees are then due to tie-breaking if two features have the same information gain? > >> > >> Best, > >> Sebastian > >> > >> > >> > >>> On Oct 27, 2018, at 6:16 PM, Javier L?pez wrote: > >>> > >>> Hi Sebastian, > >>> > >>> I think the random state is used to select the features that go into each split (look at the `max_features` parameter) > >>> > >>> Cheers, > >>> Javier > >>> > >>> On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka wrote: > >>> Hi all, > >>> > >>> when I was implementing a bagging classifier based on scikit-learn's DecisionTreeClassifier, I noticed that the results were not deterministic and found that this was due to the random_state in the DescisionTreeClassifier (which is set to None by default). > >>> > >>> I am wondering what exactly this random state is used for? I can imaging it being used for resolving ties if the information gain for multiple features is the same, or it could be that the feature splits of continuous features is different? (I thought the heuristic is to sort the features and to consider those feature values next to each associated with examples that have different class labels -- but is there maybe some random subselection involved?) > >>> > >>> If someone knows more about this, where the random_state is used, I'd be happy to hear it :) > >>> > >>> Also, we could then maybe add the info to the DecisionTreeClassifier's docstring, which is currently a bit too generic to be useful, I think: > >>> > >>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py > >>> > >>> > >>> random_state : int, RandomState instance or None, optional (default=None) > >>> If int, random_state is the seed used by the random number generator; > >>> If RandomState instance, random_state is the random number generator; > >>> If None, the random number generator is the RandomState instance used > >>> by `np.random`. > >>> > >>> > >>> Best, > >>> Sebastian > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > Piotr Szyma?ski > niedakh at gmail.com > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > > Fernando Marcos Wittmann > MS Student - Energy Systems Dept. > School of Electrical and Computer Engineering, FEEC > University of Campinas, UNICAMP, Brazil > +55 (19) 987-211302 > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From joel.nothman at gmail.com Sun Oct 28 18:48:56 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 29 Oct 2018 09:48:56 +1100 Subject: [scikit-learn] Strange code but that works In-Reply-To: <44746C14-07A8-4FEF-BFF1-1706F9D7CDAE@yahoo.fr> References: <44746C14-07A8-4FEF-BFF1-1706F9D7CDAE@yahoo.fr> Message-ID: Be careful: that @property is very significant here. It means that this is a description of how to *get* the method, not how to *run* the method. You will notice, for instance, that it says `def transform(self)`, not `def transform(self, X)` -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Sun Oct 28 22:13:36 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Sun, 28 Oct 2018 22:13:36 -0400 Subject: [scikit-learn] Pipegraph example: KMeans + LDA In-Reply-To: References: Message-ID: <681604c2-9f15-6692-682c-728f81e1d2ef@gmail.com> On 10/24/18 4:11 AM, Manuel Castej?n Limas wrote: > Dear all, > as a way of improving the documentation of PipeGraph we intend to > provide more examples of its usage. It was a popular demand to show > application cases to motivate its usage, so here it is a very simple > case with two steps: a KMeans followed by a LDA. > > https://mcasl.github.io/PipeGraph/auto_examples/plot_Finding_Number_of_clusters.html#sphx-glr-auto-examples-plot-finding-number-of-clusters-py > > This short example points out the following challenges: > - KMeans is not a transformer but an estimator KMeans is a transformer in sklearn: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.transform (you can't get the labels to be the output which is what you're doing here, but it is a transformer) > - LDA score function requires the y parameter, while its input does > not come from a known set of labels, but from the previous KMeans > - Moreover, the GridSearchCV.fit call would also require?a 'y' parameter Not true if you provide a scoring that doesn't require y or if you don't specify scoring and the scoring method of the estimator doesn't require y. GridSearchCV.fit doesn't require y. > - It would be nice to have access to the output of the KMeans step as > well. > > PipeGraph?is capable of addressing these challenges. > > The rationale for this example lies in the > identification-reconstruction realm. In a scenario where the class > labels are unknown, we might want to associate the quality of the > clustering structure to the capability of a later model to be able to > reconstruct this structure. So the basic idea here is that if LDA is > capable of getting good results it was because the information of the > KMeans was good enough for that purpose, hinting the discovery of a > good structure. > Can you provide a citation for that? That seems to heavily depend on the clustering algorithms and the classifier. To me, stability scoring seems more natural: https://arxiv.org/abs/1007.1075 This does seem interesting as well, though, haven't thought about this. It's cool that this is possible, but I feel this is still not really a "killer application" in that this is not a very common pattern. Also you could replicate something similar in sklearn with def estimator_scorer(testing_estimator): ??? def my_scorer(estimator, X, y=None): ??? ??? y = estimator.predict(X) ??????? return np.mean(cross_val_score(testing_estimator, X, y)) Though using that we'd be doing nested cross-validation on the test set... That's a bit of an issue in the current GridSearchCV implementation :-/ There's an issue by Joel somewhere to implement something that allows training without splitting which is what you'd want here. You could run the outer grid-search with a custom cross-validation iterator that returns all indices as training and test set and only does a single split, though... class NoSplitCV(object): ??? def split(self, X, y, class_weights): ??????? indices = np.arange(_num_samples(X)) ??????? yield indices, indices Though I acknowledge that your code only takes 4 lines, while mine takes 8 (thought if we'd add NoSplitCV to sklearn mine would also only take 4 lines :P) I think pipegraph is cool, not meaning to give you a hard time ;) -------------- next part -------------- An HTML attachment was scrubbed... URL: From joshua_feldman at g.harvard.edu Mon Oct 29 01:36:19 2018 From: joshua_feldman at g.harvard.edu (Feldman, Joshua) Date: Mon, 29 Oct 2018 01:36:19 -0400 Subject: [scikit-learn] Fairness Metrics Message-ID: Hi, I was wondering if there's any interest in adding fairness metrics to sklearn. Specifically, I was thinking of implementing the metrics described here: https://dsapp.uchicago.edu/projects/aequitas/ I recognize that these metrics are extremely simple to calculate, but given that sklearn is the standard machine learning package in python, I think it would be very powerful to explicitly include algorithmic fairness - it would make these methods more accessible and, as a matter of principle, demonstrate that ethics is part of ML and not an afterthought. I would love to hear the groups' thoughts and if there's interest in such a feature. Thanks! Josh -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuel.castejon at gmail.com Mon Oct 29 11:08:01 2018 From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=) Date: Mon, 29 Oct 2018 16:08:01 +0100 Subject: [scikit-learn] Pipegraph example: KMeans + LDA In-Reply-To: <681604c2-9f15-6692-682c-728f81e1d2ef@gmail.com> References: <681604c2-9f15-6692-682c-728f81e1d2ef@gmail.com> Message-ID: The long story short: Thank you for your time & sorry for inaccuracies; a few words selling a modular approach to your developments; and a request on your opinion on parallelizing Pipegraph using dask. Thank you Andreas for your patience showing me the sklearn ways. I admit that I'm still learning scikit-learn capabilities which is a tough thing as you all continue improving the library as in this new release. Keep up the good work with your developments and your teaching to the community. In particular, I learned A LOT with your answer. Big thanks! I'm inlining my comments: (....) > - KMeans is not a transformer but an estimator > > KMeans is a transformer in sklearn: > http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.transform > > (you can't get the labels to be the output which is what you're doing > here, but it is a transformer) > My bad! I saw the predict method and did not check the source code: It is true, from the code class KMeans(BaseEstimator, ClusterMixin, TransformerMixin): The point was that, as you guessed, one cannot put it in a pipeline a KMeans followed by a LDA just like that without additional efforts. > - LDA score function requires the y parameter, while its input does not > come from a known set of labels, but from the previous KMeans > - Moreover, the GridSearchCV.fit call would also require a 'y' parameter > > Not true if you provide a scoring that doesn't require y or if you don't > specify scoring and the scoring method of the estimator doesn't require y. > > GridSearchCV.fit doesn't require y. > My bad again. I wanted to mean that without the scoring function and the CV iterator that you use below, gridsearchCV will call the scoring function of the final step, i.e. LDA, and LDA scoring function wants a y. But please, bear with me, I simply did not know the proper hacks.The test does not lie, I'm quite a newbie then :-) Can you provide a citation for that? That seems to heavily depend on the > clustering algorithms and the classifier. > To me, stability scoring seems more natural: > https://arxiv.org/abs/1007.1075 > > Good to know, thank you for the reference. You are right about the dependance, it's all about the nature of the clustering and the classifier; but I was just providing a scenario, not necessarily advocating for this strategy as the solution to the number of clusters question. It's cool that this is possible, but I feel this is still not really a > "killer application" in that this is not a very common pattern. > IMHO, the beauty of the example, if there is any :-D, was the simplicity and brevity. I agree that it is not a killer application, just a possible situation. > Though I acknowledge that your code only takes 4 lines, while mine takes 8 > (thought if we'd add NoSplitCV to sklearn mine would also only take 4 lines > :P) > I think pipegraph is cool, not meaning to give you a hard time ;) > Thank you again for your time. The thing is that I believe PipeGraph can be useful for you in terms of approaching your models following a modular approach. I'm going to work on a second example implementing something similar to the VotingClassifier class to show you the approach. The main weakness is the lack of parallelism in the inner working of PipeGraph, which was never a concern for me since as far as GridSearchCV can parallelize the training I was ok with that grain size. But, now, I reckon that paralellization can be useful for you in term of approaching your models as a PipeGraph and having Parallelization for free without having to directly call joblib (thank you joblib authors for such goodie). I guess that providing a dask backend for pipegraph would be nice. But let me continue with this issue after sending the VotingClassifier example :-) Thanks, truly, I need to study hard! Manuel -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Mon Oct 29 12:08:57 2018 From: adrin.jalali at gmail.com (Adrin) Date: Mon, 29 Oct 2018 17:08:57 +0100 Subject: [scikit-learn] Google Season of Docs In-Reply-To: <339fc052-fe50-0726-b704-91e5c2ce4657@gmail.com> References: <339fc052-fe50-0726-b704-91e5c2ce4657@gmail.com> Message-ID: Related to docs, my 2 cents from the conversations I've had with people who are either new to "data science" or new to python and usually come from R: - People really like simple examples. The doctests we've added seem like a good start (at least the very few I've talked to have told me they really like it). I guess having more use-case oriented easy to find tutorials would really help new users. If I'm not mistaken, most tutorials on scikit-learn's website focus on the features and models available in the package, and not the use-case. - In many cases, when people search for something, they end up on the API page for a class or a method, which almost never include the main formula they're implementing (easy examples are the linear models which can be explain by a one liner formula). Even if the user clicks on the user guide link, they don't necessarily find how exactly it's done there. In some cases they'll need to go and read a reference paper if they want to understand the method in detail, many (or some) of which are not even open access articles. I guess it shouldn't be too hard to figure some well-formulated projects out of these ideas, and I care enough about documentation that I can give a hand wherever you think I can be useful. Cheers, Adrin. On Thu, 25 Oct 2018 at 18:47 Andreas Mueller wrote: > Hey. > Are we interested in the Google Season of Docs? > > https://docs.google.com/forms/d/e/1FAIpQLSf-njReSfmp5i2olgmsDzrFR0Ll0UB5LkCzrtyM5o9Yw0foPw/viewform > > > https://docs.google.com/presentation/d/1ABqCc5uAoQv9aqGCxmNqOJ9S_Tst-adNV3fcWQ2Quwc/edit#slide=id.g42b115f18c_0_0 > > It requires a mentor, which has been an issue in the past. > But it looks like the idea is to have professionals partner up with > projects, not students. > > The other problem would of course be formulating a clearly defined project. > I think we could probably use some restructuring, or more focused > tutorials. > > Wdyt? > > Andy > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Oct 30 11:53:04 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 30 Oct 2018 11:53:04 -0400 Subject: [scikit-learn] Google Season of Docs In-Reply-To: References: <339fc052-fe50-0726-b704-91e5c2ce4657@gmail.com> Message-ID: Hey Adrin. Thanks for your input. I had also thought about the first one. It might be a bit tricky to maintain, but would be quite helpful. I'm not entirely sure about the second. How much detail should there be on an algorithm? The math behind the variational inference in some of the Bayesian models is pretty lengthy. If you want to write down the objective, that seems feasible, but not all models optimize an objective. So it's a bit unclear to me what the scope of the docs should be. Cheers, Andy On 10/29/18 12:08 PM, Adrin wrote: > Related to docs, my 2 cents from the conversations I've had with > people who are either new to "data science" or new to python and > usually come from R: > > - People really like simple examples. The doctests we've added seem > like a good start (at least the very few I've talked to have told me > they really like it). I guess having more use-case oriented easy to > find tutorials would really help new users. If I'm not mistaken, most > tutorials on scikit-learn's website focus on the features and models > available in the package, and not the use-case. > > - In many cases, when people search for something, they end up on the > API page for a class or a method, which almost never include the main > formula they're implementing (easy examples are the linear models > which can be explain by a one liner formula). Even if the user clicks > on the user guide link, they don't necessarily find how exactly it's > done there. In some cases they'll need to go and read a reference > paper if they want to understand the method in detail, many (or some) > of which are not even open access articles. > > I guess it shouldn't be too hard to figure some well-formulated > projects out of these ideas, and I care enough about documentation > that I can give a hand wherever you think I can be useful. > > Cheers, > Adrin. > > On Thu, 25 Oct 2018 at 18:47 Andreas Mueller > wrote: > > Hey. > Are we interested in the Google Season of Docs? > https://docs.google.com/forms/d/e/1FAIpQLSf-njReSfmp5i2olgmsDzrFR0Ll0UB5LkCzrtyM5o9Yw0foPw/viewform > > https://docs.google.com/presentation/d/1ABqCc5uAoQv9aqGCxmNqOJ9S_Tst-adNV3fcWQ2Quwc/edit#slide=id.g42b115f18c_0_0 > > It requires a mentor, which has been an issue in the past. > But it looks like the idea is to have professionals partner up with > projects, not students. > > The other problem would of course be formulating a clearly defined > project. > I think we could probably use some restructuring, or more focused > tutorials. > > Wdyt? > > Andy > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Oct 30 11:57:40 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 30 Oct 2018 11:57:40 -0400 Subject: [scikit-learn] Fairness Metrics In-Reply-To: References: Message-ID: <8a5e149e-cf6d-bdac-05c3-4a8d8b222d21@gmail.com> Hi Josh. I think this would be cool to add at some point, I'm not sure this is now. I'm a bit surprised by their "fairness report". They have 4 different metrics of fairness which are conflicting. If they are all included in the fairness report then you always fail the fairness report, right? I think it would also be great to provide a tool to change predictions to be fair according to one of these criteria. I don't think there is consensus yet that these metrics are "good", in particular since they are conflicting, and so people are trying to go beyond these, I think. Cheers, Andy On 10/29/18 1:36 AM, Feldman, Joshua wrote: > Hi, > > I was wondering if there's any interest in adding fairness metrics to > sklearn. Specifically, I was thinking of implementing the metrics > described here: > > https://dsapp.uchicago.edu/projects/aequitas/ > > I recognize that these metrics are extremely simple to calculate, but > given that sklearn is the standard machine learning package in python, > I think it would be very powerful to explicitly include algorithmic > fairness - it would make these methods more accessible and, as a > matter of principle, demonstrate that ethics is part of ML and not an > afterthought. I would love to hear the groups' thoughts and if there's > interest in such a feature. > > Thanks! > > Josh > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Oct 30 11:57:56 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 30 Oct 2018 11:57:56 -0400 Subject: [scikit-learn] Fairness Metrics In-Reply-To: References: Message-ID: <409c637d-d18e-8448-869a-4a5f8ddf9ef8@gmail.com> Would be great for sklearn-contrib, though! On 10/29/18 1:36 AM, Feldman, Joshua wrote: > Hi, > > I was wondering if there's any interest in adding fairness metrics to > sklearn. Specifically, I was thinking of implementing the metrics > described here: > > https://dsapp.uchicago.edu/projects/aequitas/ > > I recognize that these metrics are extremely simple to calculate, but > given that sklearn is the standard machine learning package in python, > I think it would be very powerful to explicitly include algorithmic > fairness - it would make these methods more accessible and, as a > matter of principle, demonstrate that ethics is part of ML and not an > afterthought. I would love to hear the groups' thoughts and if there's > interest in such a feature. > > Thanks! > > Josh > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From joshua_feldman at g.harvard.edu Tue Oct 30 13:05:55 2018 From: joshua_feldman at g.harvard.edu (Feldman, Joshua) Date: Tue, 30 Oct 2018 13:05:55 -0400 Subject: [scikit-learn] Fairness Metrics Message-ID: Hi Andy, Yes, good point and thank you for your thoughts. The Aequitas project stood out to me more because of their flowchart than their auditing software because, as you mention, you always fail the report if you include all the measures! Just as with choosing a machine learning algorithm, there isn't a one size fits all solution to ML ethics, as evidenced by the contradicting metrics. A reason why I think implementing fairness metrics in sklearn might be a good idea is that it would empower people to choose the metric that's relevant to them and their users. If we were to implement these metrics, it would be very important to clarify this in the documentation. Tools that could change predictions to be fair according to one of these metrics would also be very cool. In the same vein as my thinking above, we would need to be careful about giving a false sense of security with the "fair" algorithms such a tool would produce. If you don't think now is the time to add these metrics, is there anything I could do to move this along? Best, Josh -------------- next part -------------- An HTML attachment was scrubbed... URL: