From adrin.jalali at gmail.com Wed Jun 2 05:16:33 2021 From: adrin.jalali at gmail.com (Adrin) Date: Wed, 2 Jun 2021 11:16:33 +0200 Subject: [scikit-learn] Understanding Our Contributors - NumFOCUS survey Message-ID: Hi all, NumFOCUS , our fiscal sponsorship organization, is conducting a research project looking into understanding the diversity, inclusion and barriers to participation within NumFOCUS-sponsored projects and the wider open source community. The survey will take 15-20 min to complete. We?d appreciate your contribution. The results of this survey will help NumFOCUS work closely with projects, including scikit-learn, to develop practices that will lead to project success around diversity, inclusion and sustainability. Click here to participate in the survey Thank you for your participation! -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Wed Jun 2 05:24:18 2021 From: adrin.jalali at gmail.com (Adrin) Date: Wed, 2 Jun 2021 11:24:18 +0200 Subject: [scikit-learn] custom scorer needs group information: how? In-Reply-To: References: Message-ID: Hi Emanuele, In the meantime, you could also try the hack I have written here: https://stackoverflow.com/questions/49581104/sklearn-gridsearchcv-not-using-sample-weight-in-score-function/49598597#49598597 Cheers, Adrin On Sat, May 22, 2021 at 7:54 PM Emanuele Olivetti < emanuele.olivetti at gmail.com> wrote: > Hi Alex, > > Thank you for the quick response. That SLEP looks very interesting! Indeed > I had the impression that there was no easy way around the issue of > automatically passing additional (meta)data to scorers. Irrespective of my > issue, I hope the SLEP will get the green light soon. > > Best, > > Emanuele > > On Sat, May 22, 2021 at 10:27 AM Alexandre Gramfort < > alexandre.gramfort at inria.fr> wrote: > >> hi Emanuelle, >> >> I would suggest you have a look at >> https://github.com/scikit-learn/enhancement_proposals/pull/55 >> >> it's work in progress though >> >> Alex >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From emanuele.olivetti at gmail.com Thu Jun 3 03:04:10 2021 From: emanuele.olivetti at gmail.com (Emanuele Olivetti) Date: Thu, 3 Jun 2021 09:04:10 +0200 Subject: [scikit-learn] custom scorer needs group information: how? In-Reply-To: References: Message-ID: Thank you Adrin, Your solution based on using Pandas DataFrames by leveraging the indexing that comes with them is pretty ingenious. Moreover, the whole StackOverflow page is quite interesting. I'll try also your suggestion. Best, Emanuele On Wed, Jun 2, 2021 at 11:26 AM Adrin wrote: > Hi Emanuele, > > In the meantime, you could also try the hack I have written here: > https://stackoverflow.com/questions/49581104/sklearn-gridsearchcv-not-using-sample-weight-in-score-function/49598597#49598597 > > Cheers, > Adrin > > On Sat, May 22, 2021 at 7:54 PM Emanuele Olivetti < > emanuele.olivetti at gmail.com> wrote: > >> Hi Alex, >> >> Thank you for the quick response. That SLEP looks very interesting! >> Indeed I had the impression that there was no easy way around the issue of >> automatically passing additional (meta)data to scorers. Irrespective of my >> issue, I hope the SLEP will get the green light soon. >> >> Best, >> >> Emanuele >> >> On Sat, May 22, 2021 at 10:27 AM Alexandre Gramfort < >> alexandre.gramfort at inria.fr> wrote: >> >>> hi Emanuelle, >>> >>> I would suggest you have a look at >>> https://github.com/scikit-learn/enhancement_proposals/pull/55 >>> >>> it's work in progress though >>> >>> Alex >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From reshama.stat at gmail.com Fri Jun 4 13:33:42 2021 From: reshama.stat at gmail.com (Reshama Shaikh) Date: Fri, 4 Jun 2021 13:33:42 -0400 Subject: [scikit-learn] [Data Umbrella] 3 Components for Reviewing a Pull Request (PR) In-Reply-To: References: Message-ID: Hello, The video is up for Thomas Fan's talk: 3 Components of Reviewing a Pull Request https://youtu.be/dyxS9KKCNzA It's 75 minutes, with a nice Q&A at the end. We both agreed all the topics discussed in the talk could be three separate talks. Lots of good points in the video, especially if you do contribute to scikit-learn or would like to understand the process better. --- Reshama Shaikh she/her Blog | Twitter | LinkedIn | GitHub Data Umbrella NYC PyLadies On Sun, May 23, 2021 at 10:38 AM Reshama Shaikh wrote: > Hello, > Thomas Fan, a core contributor to scikit-learn, will be presenting on > "Reviewing a Pull Request." This live webinar is scheduled for Wednesday, > June 2 at 6pm EDT. > > Sign-up info is here: > https://www.meetup.com/data-umbrella/events/278045166/ > > This presentation will be recorded and shared on YouTube about a day after > the event. You can look for it here: > https://www.youtube.com/c/DataUmbrella/featured > > Best, > Reshama > --- > Reshama Shaikh > she/her > Blog | Twitter > | LinkedIn | GitHub > > > Data Umbrella > NYC PyLadies > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mlists at ligand.eu Tue Jun 8 03:22:14 2021 From: mlists at ligand.eu (Francois Berenger) Date: Tue, 08 Jun 2021 16:22:14 +0900 Subject: [scikit-learn] Is there a model for truncated regression in sklearn? Message-ID: Hello, https://en.wikipedia.org/wiki/Truncated_regression_model Sometimes, data have missing samples when the target variable is above or below a threshold value. This is very often the case for biochemical data (e.g. target variable outside detection range of some lab equipment). I highly suspect some specific models could handle such datasets better than generic methods (i.e. train better models). Some points of entry, if that might help: - R has a truncreg package https://cran.r-project.org/web/packages/truncreg/index.html - a related paper from the wikipedia page: "Local likelihood estimation of truncated regression and its partial derivatives: Theory and application" https://hal.archives-ouvertes.fr/hal-00520650/file/PEER_stage2_10.1016%252Fj.jeconom.2008.08.007.pdf I can provide a cleaned public regression dataset, if someone is interested, for tests (there are many such datasets in ChEMBL and PubChem by the way, but you need to know how to "featurize"/encode molecules). Regards, F. From gael.varoquaux at normalesup.org Tue Jun 8 03:31:03 2021 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 8 Jun 2021 09:31:03 +0200 Subject: [scikit-learn] Is there a model for truncated regression in sklearn? In-Reply-To: References: Message-ID: <20210608073103.yig66zd4zaohgpy3@phare.normalesup.org> Hi, Scikit-learn does not cover this problem. I think that it relates to what is called survival analysis. You'll find a survival analysis package in Python at https://lifelines.readthedocs.io/en/latest/ Best, Ga?l On Tue, Jun 08, 2021 at 04:22:14PM +0900, Francois Berenger wrote: > Hello, > https://en.wikipedia.org/wiki/Truncated_regression_model > Sometimes, data have missing samples when the target variable > is above or below a threshold value. > This is very often the case for biochemical data (e.g. target > variable outside detection range of some lab equipment). > I highly suspect some specific models could handle such datasets > better than generic methods (i.e. train better models). > Some points of entry, if that might help: > - R has a truncreg package > https://cran.r-project.org/web/packages/truncreg/index.html > - a related paper from the wikipedia page: > "Local likelihood estimation of truncated regression and > its partial derivatives: Theory and application" > https://hal.archives-ouvertes.fr/hal-00520650/file/PEER_stage2_10.1016%252Fj.jeconom.2008.08.007.pdf > I can provide a cleaned public regression dataset, if someone is interested, > for tests > (there are many such datasets in ChEMBL and PubChem by the way, but you need > to know how > to "featurize"/encode molecules). > Regards, > F. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Research Director, INRIA Visiting professor, McGill http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From g.lemaitre58 at gmail.com Thu Jun 10 03:25:08 2021 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Thu, 10 Jun 2021 09:25:08 +0200 Subject: [scikit-learn] New member of the triage team: Julien Message-ID: We are excited to welcome a new member of the triage team: * Julien Jerphanion https://github.com/jjerphan The thorough work of the triage team on helping the community is much appreciated. Cheers, -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Fri Jun 11 05:44:55 2021 From: adrin.jalali at gmail.com (Adrin) Date: Fri, 11 Jun 2021 11:44:55 +0200 Subject: [scikit-learn] New member of the triage team: Julien In-Reply-To: References: Message-ID: Congratulations Julien. Happy to have you in the team :) On Thu, Jun 10, 2021 at 9:26 AM Guillaume Lema?tre wrote: > We are excited to welcome a new member of the triage team: > > * Julien Jerphanion https://github.com/jjerphan > > The thorough work of the triage team on helping the community is much > appreciated. > > Cheers, > -- > Guillaume Lemaitre > Scikit-learn @ Inria Foundation > https://glemaitre.github.io/ > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Thu Jun 17 05:33:19 2021 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Thu, 17 Jun 2021 11:33:19 +0200 Subject: [scikit-learn] New member of the triage team: Norbert Message-ID: We are excited to welcome a new member of the triage team: * Norbert Preining https://github.com/norbusan The thorough work of the triage team on helping the scikit-learn community by triaging issues and PRs, organizing sprints, responding to discussions, is extremely valuable and helpful in the development and use of scikit-learn. Cheers, -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Thu Jun 17 05:38:15 2021 From: adrin.jalali at gmail.com (Adrin) Date: Thu, 17 Jun 2021 11:38:15 +0200 Subject: [scikit-learn] New member of the triage team: Norbert In-Reply-To: References: Message-ID: Welcome to the team Norbert! On Thu, Jun 17, 2021 at 11:34 AM Guillaume Lema?tre wrote: > We are excited to welcome a new member of the triage team: > > * Norbert Preining https://github.com/norbusan > > The thorough work of the triage team on helping the scikit-learn > community by triaging issues and PRs, organizing sprints, responding > to discussions, is extremely valuable and helpful in the development > and use of scikit-learn. > > Cheers, > -- > Guillaume Lemaitre > Scikit-learn @ Inria Foundation > https://glemaitre.github.io/ > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bodor_sati at hotmail.com Thu Jun 17 06:35:09 2021 From: bodor_sati at hotmail.com (bodor sati) Date: Thu, 17 Jun 2021 10:35:09 +0000 Subject: [scikit-learn] New member of the triage team: Norbert In-Reply-To: References: Message-ID: Hi, I have only one question related to scikit-learn. how to compute topic coherence of lda models in scikit-lean. I don't find any function that calculate a coherence value. please, reply me. thanks ----------------------------------------------- Bodor Ali Bashir Sati PhD Student Sudan University of Science and Technology ________________________________ From: scikit-learn on behalf of Guillaume Lema?tre Sent: Thursday, June 17, 2021 12:33 PM To: Scikit-learn user and developer mailing list Subject: [scikit-learn] New member of the triage team: Norbert We are excited to welcome a new member of the triage team: * Norbert Preining https://github.com/norbusan The thorough work of the triage team on helping the scikit-learn community by triaging issues and PRs, organizing sprints, responding to discussions, is extremely valuable and helpful in the development and use of scikit-learn. Cheers, -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From manpritsinghece at gmail.com Fri Jun 18 06:45:10 2021 From: manpritsinghece at gmail.com (Manprit Singh) Date: Fri, 18 Jun 2021 16:15:10 +0530 Subject: [scikit-learn] function transformer Message-ID: Dear sir , Just need to know if I can use a function transformer to generate new columns in the data set . Just see the below written pipeline num_pipeline = Pipeline([('imputer', SimpleImputer(strategy="median")), ('attribs_adder', column_adder), ('std_scaler', StandardScaler()), ]) This pipeline is for numerical attributes in the dataset, firstly it will treat all mising values in the data set using SimpleImputer , then i have made a function to add three more columns in the existing data, i have made a function transformer with this function and then StandardScaler . The columns being added are generated from existing columns (by element wise division of two columns) . So Using a function transformer is ok ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From solegalli at protonmail.com Mon Jun 21 02:43:30 2021 From: solegalli at protonmail.com (Sole Galli) Date: Mon, 21 Jun 2021 06:43:30 +0000 Subject: [scikit-learn] function transformer In-Reply-To: References: Message-ID: The FunctionTransformer will apply the transformation coded your function to the entire dataset passed to the transform() method. I find it hard to see how this could work to add additional columns to the dataset, but I guess it might depend on how you designed your function. Did you try passing your function to the FunctionTransformer and then apply the transform() method on your data and see the result? Alternatively, you could create your own class to add additional columns to your data and pass that class within the pipeline. Or, easier, use the [CombineWithFeatureReference](https://feature-engine.readthedocs.io/en/latest/creation/CombineWithReferenceFeature.html) transformer from another open source package for feature engineering (Feature-engine), which does exactly what you want to do. Hope this helps Soledad Galli https://www.trainindata.com/ ??????? Original Message ??????? On Friday, June 18th, 2021 at 12:45 PM, Manprit Singh wrote: > Dear sir , > > Just need to know if I can use a function transformer to generate new columns in the data set . > > Just see the below written pipeline > > num_pipeline = Pipeline([('imputer', SimpleImputer(strategy="median")), > ('attribs_adder', column_adder), > ('std_scaler', StandardScaler()), > ]) > This pipeline is for numerical attributes in the dataset, firstly it will treat all mising values in the data set using SimpleImputer , then i have made a function to add three more columns in the existing data, i have made a function transformer with this function and then StandardScaler . > > The columns being added are generated from existing columns (by element wise division of two columns) . So Using a function transformer is ok ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From manpritsinghece at gmail.com Mon Jun 21 04:18:12 2021 From: manpritsinghece at gmail.com (Manprit Singh) Date: Mon, 21 Jun 2021 13:48:12 +0530 Subject: [scikit-learn] function transformer In-Reply-To: References: Message-ID: Dear Sir, I have made such a transformer, below given is an example that generates 3 new columns, from existing 2 columns of a numpy array , first column is for element wise addition, second is for element wise multiplication and third is for element wise division . >>> import numpy as np >>> from sklearn.preprocessing import FunctionTransformer >>> def col_add(x): x1 = x[:, 0] + x[:, 1] x2 = x[:, 0] * x[:, 1] x3 = x[:, 0] / x[:, 1] return np.c_[x, x1, x2, x3] >>> col_adder = FunctionTransformer(col_add) >>> arr = np.array([[2, 7], [4, 9], [3, 5]]) >>> arr array([[2, 7], [4, 9], [3, 5]]) >>> col_adder.transform(arr) # will add 3 columns array([[ 2. , 7. , 9. , 14. , 0.28571429], [ 4. , 9. , 13. , 36. , 0.44444444], [ 3. , 5. , 8. , 15. , 0.6 ]]) >>> So in this way a function transformer can be used to add new features generated from existing columns ? On Fri, Jun 18, 2021 at 4:15 PM Manprit Singh wrote: > Dear sir , > > Just need to know if I can use a function transformer to generate new > columns in the data set . > > Just see the below written pipeline > > num_pipeline = Pipeline([('imputer', SimpleImputer(strategy="median")), > ('attribs_adder', column_adder), > ('std_scaler', StandardScaler()), > ]) > This pipeline is for numerical attributes in the dataset, firstly it will > treat all mising values in the data set using SimpleImputer , then i have > made a function to add three more columns in the existing data, i have made > a function transformer with this function and then StandardScaler . > > The columns being added are generated from existing columns (by element > wise division of two columns) . So Using a function transformer is ok ? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Mon Jun 21 10:46:13 2021 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Mon, 21 Jun 2021 16:46:13 +0200 Subject: [scikit-learn] New member of the triage team: Norbert In-Reply-To: References: Message-ID: I am a bit late but I am very happy to see Norbert joining the triage team! Welcome! From olivier.grisel at ensta.org Mon Jun 21 11:11:33 2021 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Mon, 21 Jun 2021 17:11:33 +0200 Subject: [scikit-learn] New member of the triage team: Norbert In-Reply-To: References: Message-ID: > I have only one question related to scikit-learn. > how to compute topic coherence of lda models in scikit-lean. I don't find any function that calculate a coherence value. > please, reply me. We don't have such a metric in scikit-learn. I assume you are referring to: http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf which is implemented in Gensim as: https://radimrehurek.com/gensim/models/coherencemodel.html If I understand correctly this metric needs to compute relative frequencies of occurrences and co-occurrences of words in the documents of the training set. This feels very domain specific compared to the more domain agnostic metrics that we have in scikit-learn. From norbert at preining.info Mon Jun 21 22:37:31 2021 From: norbert at preining.info (Norbert Preining) Date: Tue, 22 Jun 2021 11:37:31 +0900 Subject: [scikit-learn] New member of the triage team: Norbert In-Reply-To: References: Message-ID: Hi everyone, On Mon, 21 Jun 2021, Olivier Grisel wrote: > I am a bit late but I am very happy to see Norbert joining the triage Thanks everyone for the welcome and I am looking forward to our collaboration. Norbert -- PREINING Norbert https://www.preining.info Fujitsu Research + IFMGA Guide + TU Wien + TeX Live + Debian Dev GPG: 0x860CDC13 fp: F7D8 A928 26E3 16A1 9FA0 ACF0 6CAC A448 860C DC13 From adrin.jalali at gmail.com Wed Jun 23 11:10:50 2021 From: adrin.jalali at gmail.com (Adrin) Date: Wed, 23 Jun 2021 17:10:50 +0200 Subject: [scikit-learn] HOWTO fix your merge conflicts after we've applied `black` Message-ID: Hi, This is to let you know that if you have an open PR, and you have merge conflicts due to the fact that now we have applied `black` to the repo, please refer to this issue which explains how you can fix your merge conflicts. Best, Adrin -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Fri Jun 25 05:55:54 2021 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Fri, 25 Jun 2021 11:55:54 +0200 Subject: [scikit-learn] scikit-learn monthly developer meeting: Monday June 28 2021 Message-ID: Dear all, The scikit-learn developer monthly meeting will take place on Monday June 28th at 3PM UTC. - Video call link: https://meet.google.com/qbg-ucpe-ngz - Meeting notes / agenda: https://hackmd.io/0yokz72CTZSny8y3Re648Q - Local times: https://www.timeanddate.com/worldclock/meetingdetails.html?year=2021&month=6&day=28&hour=15&min=0&sec=0&p1=1440&p2=240&p3=248&p4=195&p5=179&p6=224 The goal of this meeting is to discuss ongoing development topics for the project. Everybody is welcome. As usual, please follow the code of conduct of the project: https://github.com/scikit-learn/scikit-learn/blob/main/CODE_OF_CONDUCT.md Regards, -- Olivier