From guetlein at posteo.de Thu Mar 2 04:01:45 2023 From: guetlein at posteo.de (=?UTF-8?Q?Martin_G=C3=BCtlein?=) Date: Thu, 02 Mar 2023 09:01:45 +0000 Subject: [scikit-learn] classification model that can handle missing values w/o learning from missing values In-Reply-To: References: Message-ID: <0f0363e36de7849ca9abe2bc2542e441@posteo.de> It would already help us, if someone could confirm that this is not possible in sci-kit learn, because we are still not entirely sure that we have no missed something.? Regards, Martin Am 21.02.2023 15:48 schrieb Martin G?tlein: > Hi, > > I am looking for a classification model in python that can handle > missing values, without imputation and "without learning from missing > values", i.e. without using the fact that the information is missing > for the inference. > > Explained with the help of decision trees: > * The algorithm should NOT learn whether missing values should go to > the left or right child (like the HistGradientBoostingClassifier). > * Instead it could built the prediction for each child node and > aggregate these (like some Random Forest implementations do). > > If that is not possible in sci-kit learn, maybe you have already > discussed this? Or you know of a fork of sci-kit learn that is able to > do this, or some other python library? > > Any help would be really appreciated, kind regards, > Martin > > > P.S. Here is my use-case, in case you are interested: I have a binary > classification problem with a positive and a negative class, and two > types of features A and B. In my training data, I have a lot more data > (90%) where B is missing. In my test data, I always have B, which is > good because the B features are better than the A features. In the > cases where B is present in the training data, the ratio of positive > examples is much higher than when its missing. So what > HistGradientBoostingClassifier does, it uses the fact that B is not > missing in the test data, and predicts way too many positives. > (Additionally, some feature values of type A are also often missing) From gael.varoquaux at normalesup.org Fri Mar 3 02:33:31 2023 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Fri, 3 Mar 2023 08:33:31 +0100 Subject: [scikit-learn] classification model that can handle missing values w/o learning from missing values In-Reply-To: <0f0363e36de7849ca9abe2bc2542e441@posteo.de> References: <0f0363e36de7849ca9abe2bc2542e441@posteo.de> Message-ID: <20230303073331.hwjdkfcw7gk2cljo@gaellaptop> Dear Martin, >From what I understand, you want a classifier that: 1. Is not based on imputation 2. Ignores whether a value is missing or not for the inference It seems to me that those two requirements are in contradiction, and it is not clear to me how such a classifier would be theoretically grounded. Best, Ga?l On Thu, Mar 02, 2023 at 09:01:45AM +0000, Martin G?tlein wrote: > It would already help us, if someone could confirm that this is not possible > in sci-kit learn, because we are still not entirely sure that we have no > missed something.? > Regards, > Martin > Am 21.02.2023 15:48 schrieb Martin G?tlein: > > Hi, > > I am looking for a classification model in python that can handle > > missing values, without imputation and "without learning from missing > > values", i.e. without using the fact that the information is missing > > for the inference. > > Explained with the help of decision trees: > > * The algorithm should NOT learn whether missing values should go to > > the left or right child (like the HistGradientBoostingClassifier). > > * Instead it could built the prediction for each child node and > > aggregate these (like some Random Forest implementations do). > > If that is not possible in sci-kit learn, maybe you have already > > discussed this? Or you know of a fork of sci-kit learn that is able to > > do this, or some other python library? > > Any help would be really appreciated, kind regards, > > Martin > > P.S. Here is my use-case, in case you are interested: I have a binary > > classification problem with a positive and a negative class, and two > > types of features A and B. In my training data, I have a lot more data > > (90%) where B is missing. In my test data, I always have B, which is > > good because the B features are better than the A features. In the > > cases where B is present in the training data, the ratio of positive > > examples is much higher than when its missing. So what > > HistGradientBoostingClassifier does, it uses the fact that B is not > > missing in the test data, and predicts way too many positives. > > (Additionally, some feature values of type A are also often missing) > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Research Director, INRIA http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From guetlein at posteo.de Fri Mar 3 05:22:04 2023 From: guetlein at posteo.de (=?UTF-8?Q?Martin_G=C3=BCtlein?=) Date: Fri, 03 Mar 2023 10:22:04 +0000 Subject: [scikit-learn] classification model that can handle missing values w/o learning from missing values In-Reply-To: <20230303073331.hwjdkfcw7gk2cljo@gaellaptop> References: <0f0363e36de7849ca9abe2bc2542e441@posteo.de> <20230303073331.hwjdkfcw7gk2cljo@gaellaptop> Message-ID: <0ae4c5e830880b4353ca698cba93717c@posteo.de> Dear Ga?l, Thanks for your response. > 2. Ignores whether a value is missing or not for the inference What I meant is rather, that the missing value should NOT be treated as another possible value of the variable (this is e.g., what the HistGradientBoostingClassifier implementation in sk-learn does). Instead, multiple predictions could be done when a split-attribute is missing, and those can be averaged. This is how it is e.g. implemented in WEKA (we cannot switch do Java, though ;-): http://web.archive.org/web/20080601175721/http://wekadocs.com/node/2/#_edn4 and described by the inventors of the RF: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 I am pretty sure something, similar is done in other classification algorithms, like naive bayes, where each feature is handled separately anyway and missing ones could just be omitted. Regards, Martin Am 03.03.2023 08:33 schrieb Gael Varoquaux: > Dear Martin, > > From what I understand, you want a classifier that: > 1. Is not based on imputation > 2. Ignores whether a value is missing or not for the inference > > It seems to me that those two requirements are in contradiction, and > it is not clear to me how such a classifier would be theoretically > grounded. > > Best, > > Ga?l > > On Thu, Mar 02, 2023 at 09:01:45AM +0000, Martin G?tlein wrote: >> It would already help us, if someone could confirm that this is not >> possible >> in sci-kit learn, because we are still not entirely sure that we have >> no >> missed something.? > >> Regards, >> Martin > >> Am 21.02.2023 15:48 schrieb Martin G?tlein: >> > Hi, > >> > I am looking for a classification model in python that can handle >> > missing values, without imputation and "without learning from missing >> > values", i.e. without using the fact that the information is missing >> > for the inference. > >> > Explained with the help of decision trees: >> > * The algorithm should NOT learn whether missing values should go to >> > the left or right child (like the HistGradientBoostingClassifier). >> > * Instead it could built the prediction for each child node and >> > aggregate these (like some Random Forest implementations do). > >> > If that is not possible in sci-kit learn, maybe you have already >> > discussed this? Or you know of a fork of sci-kit learn that is able to >> > do this, or some other python library? > >> > Any help would be really appreciated, kind regards, >> > Martin > > >> > P.S. Here is my use-case, in case you are interested: I have a binary >> > classification problem with a positive and a negative class, and two >> > types of features A and B. In my training data, I have a lot more data >> > (90%) where B is missing. In my test data, I always have B, which is >> > good because the B features are better than the A features. In the >> > cases where B is present in the training data, the ratio of positive >> > examples is much higher than when its missing. So what >> > HistGradientBoostingClassifier does, it uses the fact that B is not >> > missing in the test data, and predicts way too many positives. >> > (Additionally, some feature values of type A are also often missing) >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn From gael.varoquaux at normalesup.org Fri Mar 3 09:41:09 2023 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Fri, 3 Mar 2023 15:41:09 +0100 Subject: [scikit-learn] classification model that can handle missing values w/o learning from missing values In-Reply-To: <0ae4c5e830880b4353ca698cba93717c@posteo.de> References: <0f0363e36de7849ca9abe2bc2542e441@posteo.de> <20230303073331.hwjdkfcw7gk2cljo@gaellaptop> <0ae4c5e830880b4353ca698cba93717c@posteo.de> Message-ID: <20230303144109.htwghqmoqgvfcooc@gaellaptop> On Fri, Mar 03, 2023 at 10:22:04AM +0000, Martin G?tlein wrote: > > 2. Ignores whether a value is missing or not for the inference > What I meant is rather, that the missing value should NOT be treated as > another possible value of the variable (this is e.g., what the > HistGradientBoostingClassifier implementation in sk-learn does). Instead, > multiple predictions could be done when a split-attribute is missing, and > those can be averaged. > This is how it is e.g. implemented in WEKA (we cannot switch do Java, though > ;-): > http://web.archive.org/web/20080601175721/http://wekadocs.com/node/2/#_edn4 > and described by the inventors of the RF: > https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 The text that you link to describes two types of strategies, one that is similar to that done in HistGradientBoosting, the other one that amounts to imputation using a forest, and can be done in scikit-learn by setting up the IteratuiveImputer to use forests as a base learner (this will however be slow). Cheers, Ga?l From lorentzen.ch at gmail.com Mon Mar 6 17:50:19 2023 From: lorentzen.ch at gmail.com (Christian Lorentzen) Date: Mon, 6 Mar 2023 23:50:19 +0100 Subject: [scikit-learn] New core developer: Tim Head Message-ID: <06b9ffea-8fed-6305-3790-0bc1c34ee20d@gmail.com> Dear all I'm very excited to announce that Tim Head, https://github.com/betatim, is joining scikit-learn as core developer. Congratulations and a warm welcome Tim! on behalf of the scikit-learn team Christian From solegalli at protonmail.com Tue Mar 7 09:53:43 2023 From: solegalli at protonmail.com (Sole Galli) Date: Tue, 07 Mar 2023 14:53:43 +0000 Subject: [scikit-learn] obtaining intervals from the decision tree struture Message-ID: Hello, I would like to obtain final intervals from the decision tree structure. I am not interested in every node, just the limits that take a sample to a final decision /leaf. For example, if the tree structure is this one: |--- feature_0 <= 0.08 | |--- class: 0 |--- feature_0 > 0.08 | |--- feature_0 <= 8.50 | | |--- feature_0 <= 1.50 | | | |--- class: 1 | | |--- feature_0 > 1.50 | | | |--- class: 1 | |--- feature_0 > 8.50 | | |--- feature_0 <= 60.25 | | | |--- class: 0 | | |--- feature_0 > 60.25 | | | |--- class: 0 Then, I would like to obtain these limits: 0-0.08 ; 0.08-1.50; 1.50-8.50 ; 8.50-60; >60 Potentially as the following numpy array: [-np.inf, 0.08, 1.5, 8.5, 60, np.inf] Is it possible? I have a stackoverflow question here for more details and code https://stackoverflow.com/questions/75663472/how-to-obtain-the-interval-limits-from-a-decision-tree-with-scikit-learn Thank you! Sole Sent with [Proton Mail](https://proton.me/) secure email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Tue Mar 7 10:41:47 2023 From: g.lemaitre58 at gmail.com (=?utf-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Tue, 7 Mar 2023 16:41:47 +0100 Subject: [scikit-learn] obtaining intervals from the decision tree struture In-Reply-To: References: Message-ID: <82CAF07D-C86E-4F57-9CAE-98DA1F7B5BB8@gmail.com> Hi Sole, You can use `apply` on the training `X` to get the leaf where the sample will fall in. Then a groupby should allow you to get the statistic that you want. Cheers, -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ > On 7 Mar 2023, at 15:53, Sole Galli via scikit-learn wrote: > > Hello, > > I would like to obtain final intervals from the decision tree structure. I am not interested in every node, just the limits that take a sample to a final decision /leaf. > > For example, if the tree structure is this one: > |--- feature_0 <= 0.08 > | |--- class: 0 > |--- feature_0 > 0.08 > | |--- feature_0 <= 8.50 > | | |--- feature_0 <= 1.50 > | | | |--- class: 1 > | | |--- feature_0 > 1.50 > | | | |--- class: 1 > | |--- feature_0 > 8.50 > | | |--- feature_0 <= 60.25 > | | | |--- class: 0 > | | |--- feature_0 > 60.25 > | | | |--- class: 0 > Then, I would like to obtain these limits: > 0-0.08 ; 0.08-1.50; 1.50-8.50 ; 8.50-60; >60 > > Potentially as the following numpy array: > [-np.inf, 0.08, 1.5, 8.5, 60, np.inf] > > Is it possible? > > I have a stackoverflow question here for more details and code > https://stackoverflow.com/questions/75663472/how-to-obtain-the-interval-limits-from-a-decision-tree-with-scikit-learn > > Thank you! > Sole > > Sent with Proton Mail secure email. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomasjpfan at gmail.com Tue Mar 7 17:07:40 2023 From: thomasjpfan at gmail.com (Thomas J. Fan) Date: Tue, 7 Mar 2023 17:07:40 -0500 Subject: [scikit-learn] scikit-learn monthly developer meeting: Monday March 27, 2023 Message-ID: Dear all, The scikit-learn developer monthly meeting will take place on Monday March 27 at 11:00 UTC. - Video call link: https://meet.google.com/gmn-acub-mrr - Meeting notes / agenda: https://hackmd.io/0yokz72CTZSny8y3Re648Q - Local times: https://www.timeanddate.com/worldclock/meetingdetails.html?year=2023&month=3&day=27&hour=11&min=0&sec=0&p1=1440&p2=240&p3=248&p4=195&p5=179&p6=224 The goal of this meeting is to discuss ongoing development topics for the project. Everybody is welcome. As usual, please follow the code of conduct of the project: https://github.com/scikit-learn/scikit-learn/blob/main/CODE_OF_CONDUCT.md Regards, Thomas -------------- next part -------------- An HTML attachment was scrubbed... URL: From betatim at gmail.com Wed Mar 8 05:06:41 2023 From: betatim at gmail.com (Tim Head) Date: Wed, 8 Mar 2023 11:06:41 +0100 Subject: [scikit-learn] New core developer: Tim Head In-Reply-To: <06b9ffea-8fed-6305-3790-0bc1c34ee20d@gmail.com> References: <06b9ffea-8fed-6305-3790-0bc1c34ee20d@gmail.com> Message-ID: Thanks a lot! I look forward to working together with the community and other contributors! T On Mon, 6 Mar 2023 at 23:51, Christian Lorentzen wrote: > Dear all > > I'm very excited to announce that Tim Head, https://github.com/betatim, > is joining scikit-learn as core developer. > Congratulations and a warm welcome Tim! > > on behalf of the scikit-learn team > Christian > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ruchika.work at gmail.com Wed Mar 8 09:33:10 2023 From: ruchika.work at gmail.com (Ruchika Nayyar) Date: Wed, 8 Mar 2023 09:33:10 -0500 Subject: [scikit-learn] New core developer: Tim Head In-Reply-To: References: <06b9ffea-8fed-6305-3790-0bc1c34ee20d@gmail.com> Message-ID: Congratulations Tim! Good to see you virtually :) Thanks, Ruchika **************** Dr. Ruchika Nayyar Data Scientist, Greene Tweed & Co. On Wed, Mar 8, 2023 at 5:09?AM Tim Head wrote: > Thanks a lot! I look forward to working together with the community and > other contributors! > > T > > On Mon, 6 Mar 2023 at 23:51, Christian Lorentzen > wrote: > >> Dear all >> >> I'm very excited to announce that Tim Head, https://github.com/betatim, >> is joining scikit-learn as core developer. >> Congratulations and a warm welcome Tim! >> >> on behalf of the scikit-learn team >> Christian >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Wed Mar 8 09:37:57 2023 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Wed, 8 Mar 2023 08:37:57 -0600 Subject: [scikit-learn] New core developer: Tim Head In-Reply-To: References: <06b9ffea-8fed-6305-3790-0bc1c34ee20d@gmail.com> Message-ID: Awesome news! Congrats Tim! Cheers, Sebastian On Mar 8, 2023, 8:35 AM -0600, Ruchika Nayyar , wrote: > Congratulations Tim! Good to see you virtually :) > > Thanks, > Ruchika > > **************** > Dr. Ruchika Nayyar > Data Scientist, Greene Tweed & Co. > > > > On Wed, Mar 8, 2023 at 5:09?AM Tim Head wrote: > > > Thanks a lot! I look forward to working together with the community and other contributors! > > > > > > T > > > > > > > On Mon, 6 Mar 2023 at 23:51, Christian Lorentzen wrote: > > > > > Dear all > > > > > > > > > > I'm very excited to announce that Tim Head, https://github.com/betatim, > > > > > is joining scikit-learn as core developer. > > > > > Congratulations and a warm welcome Tim! > > > > > > > > > > on behalf of the scikit-learn team > > > > > Christian > > > > > > > > > > _______________________________________________ > > > > > scikit-learn mailing list > > > > > scikit-learn at python.org > > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris at aridas.eu Wed Mar 8 10:33:04 2023 From: chris at aridas.eu (Chris Aridas) Date: Wed, 8 Mar 2023 17:33:04 +0200 Subject: [scikit-learn] New core developer: Tim Head In-Reply-To: References: <06b9ffea-8fed-6305-3790-0bc1c34ee20d@gmail.com> Message-ID: Congrats Tim! Best, Chris On Wed, Mar 8, 2023 at 5:02?PM Sebastian Raschka wrote: > Awesome news! Congrats Tim! > > Cheers, > Sebastian > > > > > > > > > On Mar 8, 2023, 8:35 AM -0600, Ruchika Nayyar , > wrote: > > Congratulations Tim! Good to see you virtually :) > > Thanks, > Ruchika > > **************** > Dr. Ruchika Nayyar > Data Scientist, Greene Tweed & Co. > > > On Wed, Mar 8, 2023 at 5:09?AM Tim Head wrote: > >> Thanks a lot! I look forward to working together with the community and >> other contributors! >> >> T >> >> On Mon, 6 Mar 2023 at 23:51, Christian Lorentzen >> wrote: >> >>> Dear all >>> >>> I'm very excited to announce that Tim Head, https://github.com/betatim, >>> is joining scikit-learn as core developer. >>> Congratulations and a warm welcome Tim! >>> >>> on behalf of the scikit-learn team >>> Christian >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeremie.du-boisberranger at inria.fr Thu Mar 9 05:15:42 2023 From: jeremie.du-boisberranger at inria.fr (Jeremie du Boisberranger) Date: Thu, 9 Mar 2023 11:15:42 +0100 Subject: [scikit-learn] [ANN] scikit-learn 1.2.2 is online! In-Reply-To: References: Message-ID: <1c32cfc9-8c4f-dc6d-4656-3ea53cff3a25@inria.fr> scikit-learn 1.2.2 is out on pypi.org and conda-forge! This is a maintenance release that fixes several regressions introduced in version 1.2 https://scikit-learn.org/stable/whats_new/v1.2.html#version-1-2-2 You can upgrade with pip as usual: |pip install -U scikit-learn | The conda-forge builds will be available shortly, which you can then install using: |conda install -c conda-forge scikit-learn | Thanks to all contributors who helped on this release. J?r?mie, On the behalf of the Scikit-learn maintainers team. -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Thu Mar 9 05:32:37 2023 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Thu, 9 Mar 2023 11:32:37 +0100 Subject: [scikit-learn] [ANN] scikit-learn 1.2.2 is online! In-Reply-To: <1c32cfc9-8c4f-dc6d-4656-3ea53cff3a25@inria.fr> References: <1c32cfc9-8c4f-dc6d-4656-3ea53cff3a25@inria.fr> Message-ID: Thanks for taking care of this release Jeremie. Cheers, On Thu, 9 Mar 2023 at 11:17, Jeremie du Boisberranger < jeremie.du-boisberranger at inria.fr> wrote: > scikit-learn 1.2.2 is out on pypi.org and conda-forge! > This is a maintenance release that fixes several regressions introduced in > version 1.2 > > https://scikit-learn.org/stable/whats_new/v1.2.html#version-1-2-2 > > You can upgrade with pip as usual: > > pip install -U scikit-learn > > The conda-forge builds will be available shortly, which you can then > install using: > > conda install -c conda-forge scikit-learn > > > Thanks to all contributors who helped on this release. > J?r?mie, > On the behalf of the Scikit-learn maintainers team. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From guetlein at posteo.de Fri Mar 10 08:19:09 2023 From: guetlein at posteo.de (=?UTF-8?Q?Martin_G=C3=BCtlein?=) Date: Fri, 10 Mar 2023 13:19:09 +0000 Subject: [scikit-learn] classification model that can handle missing values w/o learning from missing values In-Reply-To: <20230303144109.htwghqmoqgvfcooc@gaellaptop> References: <0f0363e36de7849ca9abe2bc2542e441@posteo.de> <20230303073331.hwjdkfcw7gk2cljo@gaellaptop> <0ae4c5e830880b4353ca698cba93717c@posteo.de> <20230303144109.htwghqmoqgvfcooc@gaellaptop> Message-ID: Hi Ga?l, > [...] the other one that > amounts to imputation using a forest, and can be done in scikit-learn > by setting up the IteratuiveImputer to use forests as a base learner > (this will however be slow). The main difference is that when I use the IterativeImputer in scikit-learn, I still have to apply this imputation on the test set, before being able to predict with the RF. However, other implementations do not impute missing values, but instead split up the test instance. I made the experience that this makes a big difference, and you are able to use features where the majority of values is missing, and where at the same time the class ratio of the examples with missing values is largely different to those without missing values. Kind regards, Martin Am 03.03.2023 15:41 schrieb Gael Varoquaux: > On Fri, Mar 03, 2023 at 10:22:04AM +0000, Martin G?tlein wrote: >> > 2. Ignores whether a value is missing or not for the inference >> What I meant is rather, that the missing value should NOT be treated >> as >> another possible value of the variable (this is e.g., what the >> HistGradientBoostingClassifier implementation in sk-learn does). >> Instead, >> multiple predictions could be done when a split-attribute is missing, >> and >> those can be averaged. > >> This is how it is e.g. implemented in WEKA (we cannot switch do Java, >> though >> ;-): >> http://web.archive.org/web/20080601175721/http://wekadocs.com/node/2/#_edn4 >> and described by the inventors of the RF: >> https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 > > The text that you link to describes two types of strategies, one that > is similar to that done in HistGradientBoosting, the other one that > amounts to imputation using a forest, and can be done in scikit-learn > by setting up the IteratuiveImputer to use forests as a base learner > (this will however be slow). > > Cheers, > > Ga?l From g.lemaitre58 at gmail.com Fri Mar 10 08:38:31 2023 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Fri, 10 Mar 2023 14:38:31 +0100 Subject: [scikit-learn] classification model that can handle missing values w/o learning from missing values In-Reply-To: References: <0f0363e36de7849ca9abe2bc2542e441@posteo.de> <20230303073331.hwjdkfcw7gk2cljo@gaellaptop> <0ae4c5e830880b4353ca698cba93717c@posteo.de> <20230303144109.htwghqmoqgvfcooc@gaellaptop> Message-ID: Hi Martin, I think that you could use `imbalanced-learn` and a bit of Pandas/NumPy to get the behaviour that you want. You can use a `FunctionSampler` ( https://imbalanced-learn.org/stable/references/generated/imblearn.FunctionSampler.html) in which you remove the sample containing missing values. This process is only apply when calling `fit`. You will need to use the `Pipeline` from imbalanced-learn` as well. In some way, it seems that you want to resample the training set which what the `Sampler` are intended for in `imbalanced-learn`. Cheers, On Fri, 10 Mar 2023 at 14:21, Martin G?tlein wrote: > Hi Ga?l, > > > [...] the other one that > > amounts to imputation using a forest, and can be done in scikit-learn > > by setting up the IteratuiveImputer to use forests as a base learner > > (this will however be slow). > > The main difference is that when I use the IterativeImputer in > scikit-learn, I still have to apply this imputation on the test set, > before being able to predict with the RF. However, other implementations > do not impute missing values, but instead split up the test instance. > > I made the experience that this makes a big difference, and you are able > to use features where the majority of values is missing, and where at > the same time the class ratio of the examples with missing values is > largely different to those without missing values. > > Kind regards, > Martin > > > > > > Am 03.03.2023 15:41 schrieb Gael Varoquaux: > > On Fri, Mar 03, 2023 at 10:22:04AM +0000, Martin G?tlein wrote: > >> > 2. Ignores whether a value is missing or not for the inference > >> What I meant is rather, that the missing value should NOT be treated > >> as > >> another possible value of the variable (this is e.g., what the > >> HistGradientBoostingClassifier implementation in sk-learn does). > >> Instead, > >> multiple predictions could be done when a split-attribute is missing, > >> and > >> those can be averaged. > > > >> This is how it is e.g. implemented in WEKA (we cannot switch do Java, > >> though > >> ;-): > >> > http://web.archive.org/web/20080601175721/http://wekadocs.com/node/2/#_edn4 > >> and described by the inventors of the RF: > >> > https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1 > > > > The text that you link to describes two types of strategies, one that > > is similar to that done in HistGradientBoosting, the other one that > > amounts to imputation using a forest, and can be done in scikit-learn > > by setting up the IteratuiveImputer to use forests as a base learner > > (this will however be slow). > > > > Cheers, > > > > Ga?l > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Tue Mar 14 07:14:16 2023 From: adrin.jalali at gmail.com (Adrin) Date: Tue, 14 Mar 2023 12:14:16 +0100 Subject: [scikit-learn] VOTE: Governance update: elevating voting rights Message-ID: Hi, Since SLEP020 updates to the governance model don't require a separate SLEP, and can be done through a pull request on the repo. This PR introduces certain changes to the governance, which in effect is elevating voting rights of two existing groups: Contributor experience team and the communication team. It also renames "core developers" team to "maintainers", and puts them with the above two teams in a "Core contributors" group. According to our governance, we need to call a vote for any such changes, hence I'm calling for a vote. Please vote on the pull request . The vote will conclude in a month, and we need a 2/3 majority of the cast vote to pass the motion. Regards, Adrin -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Sat Mar 25 15:42:25 2023 From: adrin.jalali at gmail.com (Adrin) Date: Sat, 25 Mar 2023 20:42:25 +0100 Subject: [scikit-learn] CFP: GitHub copilot for PRs Message-ID: What do we think of GitHub copilot for PRs? Had anybody tried it? Is it something we think is a good idea at this point? I'm gonna try it on some smaller repos and see what it does. https://github.com/features/preview/copilot-x Cheers, Adrin -------------- next part -------------- An HTML attachment was scrubbed... URL: From readyready15728 at gmail.com Sat Mar 25 15:49:23 2023 From: readyready15728 at gmail.com (Lynn Bradshaw) Date: Sat, 25 Mar 2023 15:49:23 -0400 Subject: [scikit-learn] CFP: GitHub copilot for PRs In-Reply-To: References: Message-ID: I'm disinclined to use it, barring extensive human review, because of chats I've had like this [image: Screenshot 2023-03-19 at 20-47-32 ChatGPT.png] On Sat, Mar 25, 2023 at 3:43?PM Adrin wrote: > What do we think of GitHub copilot for PRs? > > Had anybody tried it? Is it something we think is a good idea at this > point? > > I'm gonna try it on some smaller repos and see what it does. > > https://github.com/features/preview/copilot-x > > Cheers, > Adrin > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Screenshot 2023-03-19 at 20-47-32 ChatGPT.png Type: image/png Size: 107843 bytes Desc: not available URL: From g.lemaitre58 at gmail.com Sat Mar 25 16:20:38 2023 From: g.lemaitre58 at gmail.com (=?utf-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Sat, 25 Mar 2023 21:20:38 +0100 Subject: [scikit-learn] CFP: GitHub copilot for PRs In-Reply-To: References: Message-ID: I assume that we need to check which feature could be used. For instance, providing automatic description in PRs could be something that I kind of like. Proposing non-regression tests for new comers that never wrote some could also be useful. At the end, we will always add manual reviews before merging. The 1 billion dollars question is ?is Copilot X can accelerate or ease the contribution or reviewing process?? Cheers, > On 25 Mar 2023, at 20:49, Lynn Bradshaw wrote: > > I'm disinclined to use it, barring extensive human review, because of chats I've had like this > > > > On Sat, Mar 25, 2023 at 3:43?PM Adrin > wrote: >> What do we think of GitHub copilot for PRs? >> >> Had anybody tried it? Is it something we think is a good idea at this point? >> >> I'm gonna try it on some smaller repos and see what it does. >> >> https://github.com/features/preview/copilot-x >> >> Cheers, >> Adrin >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: