From g.lemaitre58 at gmail.com Fri May 1 03:58:11 2020 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Fri, 1 May 2020 09:58:11 +0200 Subject: [scikit-learn] Why does sklearn require one-hot-encoding for categorical features? Can we have a "factor" data type? In-Reply-To: References: <20200430201206.75tl2ohkxo5yerlo@phare.normalesup.org> Message-ID: OrdinalEncoder is the equivalent of pd.factorize and will work in the scikit-learn ecosystem. However, be aware that you should not just swap OneHotEncoder to OrdinalEncoder just at your wish. It depends of your machine learning pipeline. As mentioned by Gael, tree-based algorithm will be fine with OrdinalEncoder. If you have a linear model, then you need to use the OneHotEncoder if the categories do not have any order. I will just refer to one notebook that we taught in EuroScipy last year: https://github.com/lesteve/euroscipy-2019-scikit-learn-tutorial/blob/master/rendered_notebooks/02_basic_preprocessing.ipynb On Fri, 1 May 2020 at 05:11, C W wrote: > Hermes, > > That's an interesting function. Does it work with sklearn after > factorize? Is there any example? Thanks! > > On Thu, Apr 30, 2020 at 6:51 PM Hermes Morales > wrote: > >> Perhaps pd.factorize could hello? >> >> Obtener Outlook para Android >> >> ------------------------------ >> *From:* scikit-learn > hotmail.com at python.org> on behalf of Gael Varoquaux < >> gael.varoquaux at normalesup.org> >> *Sent:* Thursday, April 30, 2020 5:12:06 PM >> *To:* Scikit-learn mailing list >> *Subject:* Re: [scikit-learn] Why does sklearn require one-hot-encoding >> for categorical features? Can we have a "factor" data type? >> >> On Thu, Apr 30, 2020 at 03:55:00PM -0400, C W wrote: >> > I've used R and Stata software, none needs such transformation. They >> have a >> > data type called "factors", which is different from "numeric". >> >> > My problem with OHE: >> > One-hot-encoding results in large number of features. This really blows >> up >> > quickly. And I have to fight curse of dimensionality with PCA >> reduction. That's >> > not cool! >> >> Most statistical models still not one-hot encoding behind the hood. So, R >> and stata do it too. >> >> Typically, tree-based models can be adapted to work directly on >> categorical data. Ours don't. It's work in progress. >> >> G >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> >> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fscikit-learn&data=02%7C01%7C%7Ce7aa6f99b7914a1f84b208d7ed430801%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637238744453345410&sdata=e3BfHB4v5VFteeZ0Zh3FJ9Wcz9KmkUwur5i8Reue3mc%3D&reserved=0 >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From tmrsg11 at gmail.com Fri May 1 11:53:45 2020 From: tmrsg11 at gmail.com (C W) Date: Fri, 1 May 2020 11:53:45 -0400 Subject: [scikit-learn] Why does sklearn require one-hot-encoding for categorical features? Can we have a "factor" data type? In-Reply-To: References: <20200430201206.75tl2ohkxo5yerlo@phare.normalesup.org> Message-ID: Thank you for the link, Guilaumme. In my particular case, I am working on random forest classification. The notebook seems great. I will have to go through it in detail. I'm still fairly new at using sklearn. Thank you for everyone's quick response, always feeling loved on here! :) On Fri, May 1, 2020 at 4:00 AM Guillaume Lema?tre wrote: > OrdinalEncoder is the equivalent of pd.factorize and will work in the > scikit-learn ecosystem. > > However, be aware that you should not just swap OneHotEncoder to > OrdinalEncoder just at your wish. > It depends of your machine learning pipeline. > > As mentioned by Gael, tree-based algorithm will be fine with > OrdinalEncoder. If you have a linear model, > then you need to use the OneHotEncoder if the categories do not have any > order. > > I will just refer to one notebook that we taught in EuroScipy last year: > > https://github.com/lesteve/euroscipy-2019-scikit-learn-tutorial/blob/master/rendered_notebooks/02_basic_preprocessing.ipynb > > On Fri, 1 May 2020 at 05:11, C W wrote: > >> Hermes, >> >> That's an interesting function. Does it work with sklearn after >> factorize? Is there any example? Thanks! >> >> On Thu, Apr 30, 2020 at 6:51 PM Hermes Morales >> wrote: >> >>> Perhaps pd.factorize could hello? >>> >>> Obtener Outlook para Android >>> >>> ------------------------------ >>> *From:* scikit-learn >> hotmail.com at python.org> on behalf of Gael Varoquaux < >>> gael.varoquaux at normalesup.org> >>> *Sent:* Thursday, April 30, 2020 5:12:06 PM >>> *To:* Scikit-learn mailing list >>> *Subject:* Re: [scikit-learn] Why does sklearn require one-hot-encoding >>> for categorical features? Can we have a "factor" data type? >>> >>> On Thu, Apr 30, 2020 at 03:55:00PM -0400, C W wrote: >>> > I've used R and Stata software, none needs such transformation. They >>> have a >>> > data type called "factors", which is different from "numeric". >>> >>> > My problem with OHE: >>> > One-hot-encoding results in large number of features. This really >>> blows up >>> > quickly. And I have to fight curse of dimensionality with PCA >>> reduction. That's >>> > not cool! >>> >>> Most statistical models still not one-hot encoding behind the hood. So, R >>> and stata do it too. >>> >>> Typically, tree-based models can be adapted to work directly on >>> categorical data. Ours don't. It's work in progress. >>> >>> G >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> >>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fscikit-learn&data=02%7C01%7C%7Ce7aa6f99b7914a1f84b208d7ed430801%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637238744453345410&sdata=e3BfHB4v5VFteeZ0Zh3FJ9Wcz9KmkUwur5i8Reue3mc%3D&reserved=0 >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > Guillaume Lemaitre > Scikit-learn @ Inria Foundation > https://glemaitre.github.io/ > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonpsy101 at gmail.com Sat May 2 00:57:00 2020 From: jonpsy101 at gmail.com (sai_ng) Date: Sat, 2 May 2020 10:27:00 +0530 Subject: [scikit-learn] Random Binning Features Message-ID: Hey folks ! Hope you're all doing well. I'm developing Random Fourier Feature implementation in c++ for a repository. Scikits implementation on RBFSampler has been really helpful, and I must say that I'm charmed but how compact, yet powerful each line of code is. I'm writing this mail because I couldn't find your implementation of Random Binning Features, is it under development?. I tried searching in the issues but, to no avail. I noticed you've put few of your algorithms on a different repository for ex: https://scikit-learn-extra.readthedocs.io/en/latest/generated/sklearn_extra.kernel_approximation.Fastfood.html. Overall, I'd like to know if it's under development or has there been any draft/proposal or is it already implemented. I'd greatly appreciate if you could point me to other sources (if not here) which have successfully implemented it in code (preferably python/c++) "Hit me back, Just to chat, Your biggest fan, This is stan" ~ Eminem: Stan -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Sat May 2 04:13:30 2020 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Sat, 2 May 2020 10:13:30 +0200 Subject: [scikit-learn] Vote: Add Adrin Jalali to the scikit-learn technical committee In-Reply-To: References: <7d9ffac3-35d0-7e30-9c96-3c125b4f9fe7@gmail.com> <8f0f6d85-2d52-4bb5-bd5d-c13d60377364@Canary> Message-ID: +1 On Tue, 28 Apr 2020 at 20:59, Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > +1 > > On Tue, 28 Apr 2020 at 01:34, Joel Nothman wrote: > >> +1 >> >> On Tue, 28 Apr 2020 at 02:23, Tom DLT wrote: >> >>> +1 >>> >>> Le lun. 27 avr. 2020, ? 07 h 00, Alexandre Gramfort < >>> alexandre.gramfort at inria.fr> a ?crit : >>> >>>> +1 >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From ahowe42 at gmail.com Tue May 5 08:29:18 2020 From: ahowe42 at gmail.com (Andrew Howe) Date: Tue, 5 May 2020 13:29:18 +0100 Subject: [scikit-learn] Fwd: StackingClassifier In-Reply-To: References: Message-ID: Hi All - gentle nudge in case anybody has an idea about this. Andrew <~~~~~~~~~~~~~~~~~~~~~~~~~~~> J. Andrew Howe, PhD LinkedIn Profile ResearchGate Profile Open Researcher and Contributor ID (ORCID) Github Profile Personal Website I live to learn, so I can learn to live. - me <~~~~~~~~~~~~~~~~~~~~~~~~~~~> ---------- Forwarded message --------- From: Andrew Howe Date: Thu, Apr 30, 2020 at 6:05 PM Subject: StackingClassifier To: Scikit-learn user and developer mailing list Hi All Quick question about the stacking classifier . How do I know the order of the features that the final estimator uses? I've got an example which I've created like this (the LGRG and KSVM objects were previously defined, but as they seem they would be): passThrough = True finalEstim = DecisionTreeClassifier(random_state=42) stkClas = StackingClassifier(estimators=[('Logistic Regression', LGRG), ('Kernel SVM', KSVM)], cv=crossValInput, passthrough=passThrough, final_estimator=finalEstim, n_jobs=-1) Given this setup, I *think* the features input to the final estimator are - Logistic regression prediction probabilities for all classes - Kernel SVM prediction probabilities for all classes - original features of data passed into the stacking classifier I can find no documentation on this, though, and don't know of any relevant attribute on the final estimator. I need this to help interpret the final estimator tree - and specifically to provide feature labels for plot_tree. Thanks! Andrew <~~~~~~~~~~~~~~~~~~~~~~~~~~~> J. Andrew Howe, PhD LinkedIn Profile ResearchGate Profile Open Researcher and Contributor ID (ORCID) Github Profile Personal Website I live to learn, so I can learn to live. - me <~~~~~~~~~~~~~~~~~~~~~~~~~~~> -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Tue May 5 08:40:28 2020 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Tue, 5 May 2020 14:40:28 +0200 Subject: [scikit-learn] Fwd: StackingClassifier In-Reply-To: References: Message-ID: Your analysis is correct: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_stacking.py#L59 It will be the prediction of each learner in the order in the list given and finally the features which are pass-through. It would nice when we will be able to propagate feature names :) On Tue, 5 May 2020 at 14:31, Andrew Howe wrote: > Hi All - gentle nudge in case anybody has an idea about this. > > Andrew > > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > J. Andrew Howe, PhD > LinkedIn Profile > ResearchGate Profile > Open Researcher and Contributor ID (ORCID) > > Github Profile > Personal Website > I live to learn, so I can learn to live. - me > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > > > ---------- Forwarded message --------- > From: Andrew Howe > Date: Thu, Apr 30, 2020 at 6:05 PM > Subject: StackingClassifier > To: Scikit-learn user and developer mailing list > > > Hi All > > Quick question about the stacking classifier > . > How do I know the order of the features that the final estimator uses? I've > got an example which I've created like this (the LGRG and KSVM objects were > previously defined, but as they seem they would be): > > passThrough = True > finalEstim = DecisionTreeClassifier(random_state=42) > stkClas = StackingClassifier(estimators=[('Logistic Regression', LGRG), > ('Kernel SVM', KSVM)], > cv=crossValInput, passthrough=passThrough, > final_estimator=finalEstim, > n_jobs=-1) > > Given this setup, I *think* the features input to the final estimator are > > - Logistic regression prediction probabilities for all classes > - Kernel SVM prediction probabilities for all classes > - original features of data passed into the stacking classifier > > I can find no documentation on this, though, and don't know of any > relevant attribute on the final estimator. I need this to help interpret > the final estimator tree - and specifically to provide feature labels for > plot_tree. > > Thanks! > Andrew > > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > J. Andrew Howe, PhD > LinkedIn Profile > ResearchGate Profile > Open Researcher and Contributor ID (ORCID) > > Github Profile > Personal Website > I live to learn, so I can learn to live. - me > <~~~~~~~~~~~~~~~~~~~~~~~~~~~> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From ahowe42 at gmail.com Tue May 5 08:47:55 2020 From: ahowe42 at gmail.com (Andrew Howe) Date: Tue, 5 May 2020 13:47:55 +0100 Subject: [scikit-learn] Fwd: StackingClassifier In-Reply-To: References: Message-ID: Great - thanks! Yes, it would be very nice to have feature names automatically propagate throughout sklearn. Andrew <~~~~~~~~~~~~~~~~~~~~~~~~~~~> J. Andrew Howe, PhD LinkedIn Profile ResearchGate Profile Open Researcher and Contributor ID (ORCID) Github Profile Personal Website I live to learn, so I can learn to live. - me <~~~~~~~~~~~~~~~~~~~~~~~~~~~> On Tue, May 5, 2020 at 1:42 PM Guillaume Lema?tre wrote: > Your analysis is correct: > https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_stacking.py#L59 > > It will be the prediction of each learner in the order in the list given > and finally the features which are pass-through. > > It would nice when we will be able to propagate feature names :) > > On Tue, 5 May 2020 at 14:31, Andrew Howe wrote: > >> Hi All - gentle nudge in case anybody has an idea about this. >> >> Andrew >> >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~> >> J. Andrew Howe, PhD >> LinkedIn Profile >> ResearchGate Profile >> Open Researcher and Contributor ID (ORCID) >> >> Github Profile >> Personal Website >> I live to learn, so I can learn to live. - me >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~> >> >> >> ---------- Forwarded message --------- >> From: Andrew Howe >> Date: Thu, Apr 30, 2020 at 6:05 PM >> Subject: StackingClassifier >> To: Scikit-learn user and developer mailing list > > >> >> >> Hi All >> >> Quick question about the stacking classifier >> . >> How do I know the order of the features that the final estimator uses? I've >> got an example which I've created like this (the LGRG and KSVM objects were >> previously defined, but as they seem they would be): >> >> passThrough = True >> finalEstim = DecisionTreeClassifier(random_state=42) >> stkClas = StackingClassifier(estimators=[('Logistic Regression', LGRG), >> ('Kernel SVM', KSVM)], >> cv=crossValInput, passthrough=passThrough, >> final_estimator=finalEstim, >> n_jobs=-1) >> >> Given this setup, I *think* the features input to the final estimator are >> >> - Logistic regression prediction probabilities for all classes >> - Kernel SVM prediction probabilities for all classes >> - original features of data passed into the stacking classifier >> >> I can find no documentation on this, though, and don't know of any >> relevant attribute on the final estimator. I need this to help interpret >> the final estimator tree - and specifically to provide feature labels for >> plot_tree. >> >> Thanks! >> Andrew >> >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~> >> J. Andrew Howe, PhD >> LinkedIn Profile >> ResearchGate Profile >> Open Researcher and Contributor ID (ORCID) >> >> Github Profile >> Personal Website >> I live to learn, so I can learn to live. - me >> <~~~~~~~~~~~~~~~~~~~~~~~~~~~> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > Guillaume Lemaitre > Scikit-learn @ Inria Foundation > https://glemaitre.github.io/ > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Wed May 6 02:56:52 2020 From: adrin.jalali at gmail.com (Adrin) Date: Wed, 6 May 2020 08:56:52 +0200 Subject: [scikit-learn] ANN: scikit-learn 0.23 RC1 Message-ID: Thanks to all our 200+ contributors, we are announcing a release candidate for the upcoming release. On top of a few exciting features, we're also deprecating positional arguments in many places where the constructor/method accepts many arguments. for example, SVC(.5, "poly") will need to be expressed as SVC(C=.5, kernel="poly"), and SVC(C, kernel) as SVC(C=C, kernel=kernel). Please give it a try and let us know if there are any issues for us to fix them for the final release. Release highlights: https://scikit-learn.org/0.23/auto_examples/release_highlights/plot_release_highlights_0_23_0.html Changelog: https://scikit-learn.org/0.23/whats_new/v0.23.html#changes-0-23 Happy testing, Adrin On behalf of the scikit-learn team -------------- next part -------------- An HTML attachment was scrubbed... URL: From fernando.wittmann at gmail.com Wed May 6 09:36:55 2020 From: fernando.wittmann at gmail.com (Fernando Marcos Wittmann) Date: Wed, 6 May 2020 10:36:55 -0300 Subject: [scikit-learn] Why does sklearn require one-hot-encoding for categorical features? Can we have a "factor" data type? In-Reply-To: References: <20200430201206.75tl2ohkxo5yerlo@phare.normalesup.org> Message-ID: That's an excellent discussion! I've always wondered how other tools like R handled naturally categorical variables or not. LightGBM has a scikit-learn API which handles categorical features by inputting their columns names (or indexes): ``` import lightgbm lgb=lightgbm.LGBMClassifier() lgb.fit(*X*, *y*, *feature_name=... *, *categorical_feature=... *) ``` Where: - feature_name (list of strings or 'auto', optional (default='auto')) ? Feature names. If ?auto? and data is pandas DataFrame, data columns names are used. - categorical_feature (list of strings or int, or 'auto', optional (default='auto')) ? Categorical features. If list of int, interpreted as indices. If list of strings, interpreted as feature names (need to specify feature_name as well). If ?auto? and data is pandas DataFrame, pandas unordered categorical columns are used. All values in categorical features should be less than int32 max value (2147483647). As a suggestion, Scikit-Learn could add a `categorical_feature` parameter in the tree-based estimators in order to work on the same way. On Fri, May 1, 2020 at 12:54 PM C W wrote: > Thank you for the link, Guilaumme. In my particular case, I am working on > random forest classification. > > The notebook seems great. I will have to go through it in detail. I'm > still fairly new at using sklearn. > > Thank you for everyone's quick response, always feeling loved on here! :) > > > > On Fri, May 1, 2020 at 4:00 AM Guillaume Lema?tre > wrote: > >> OrdinalEncoder is the equivalent of pd.factorize and will work in the >> scikit-learn ecosystem. >> >> However, be aware that you should not just swap OneHotEncoder to >> OrdinalEncoder just at your wish. >> It depends of your machine learning pipeline. >> >> As mentioned by Gael, tree-based algorithm will be fine with >> OrdinalEncoder. If you have a linear model, >> then you need to use the OneHotEncoder if the categories do not have any >> order. >> >> I will just refer to one notebook that we taught in EuroScipy last year: >> >> https://github.com/lesteve/euroscipy-2019-scikit-learn-tutorial/blob/master/rendered_notebooks/02_basic_preprocessing.ipynb >> >> On Fri, 1 May 2020 at 05:11, C W wrote: >> >>> Hermes, >>> >>> That's an interesting function. Does it work with sklearn after >>> factorize? Is there any example? Thanks! >>> >>> On Thu, Apr 30, 2020 at 6:51 PM Hermes Morales < >>> paisanohermes at hotmail.com> wrote: >>> >>>> Perhaps pd.factorize could hello? >>>> >>>> Obtener Outlook para Android >>>> >>>> ------------------------------ >>>> *From:* scikit-learn >>> hotmail.com at python.org> on behalf of Gael Varoquaux < >>>> gael.varoquaux at normalesup.org> >>>> *Sent:* Thursday, April 30, 2020 5:12:06 PM >>>> *To:* Scikit-learn mailing list >>>> *Subject:* Re: [scikit-learn] Why does sklearn require >>>> one-hot-encoding for categorical features? Can we have a "factor" data type? >>>> >>>> On Thu, Apr 30, 2020 at 03:55:00PM -0400, C W wrote: >>>> > I've used R and Stata software, none needs such transformation. They >>>> have a >>>> > data type called "factors", which is different from "numeric". >>>> >>>> > My problem with OHE: >>>> > One-hot-encoding results in large number of features. This really >>>> blows up >>>> > quickly. And I have to fight curse of dimensionality with PCA >>>> reduction. That's >>>> > not cool! >>>> >>>> Most statistical models still not one-hot encoding behind the hood. So, >>>> R >>>> and stata do it too. >>>> >>>> Typically, tree-based models can be adapted to work directly on >>>> categorical data. Ours don't. It's work in progress. >>>> >>>> G >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> >>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fscikit-learn&data=02%7C01%7C%7Ce7aa6f99b7914a1f84b208d7ed430801%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637238744453345410&sdata=e3BfHB4v5VFteeZ0Zh3FJ9Wcz9KmkUwur5i8Reue3mc%3D&reserved=0 >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> -- >> Guillaume Lemaitre >> Scikit-learn @ Inria Foundation >> https://glemaitre.github.io/ >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Wed May 6 09:43:41 2020 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 6 May 2020 23:43:41 +1000 Subject: [scikit-learn] Why does sklearn require one-hot-encoding for categorical features? Can we have a "factor" data type? In-Reply-To: References: <20200430201206.75tl2ohkxo5yerlo@phare.normalesup.org> Message-ID: When it comes to trees, the API for handling categoricals is simpler than the implementation. Traditionally, tree-based models' handling of categorical variables differs from both ordinal and one-hot encoding, while both of those will work reasonably well for many problems. We are working on implementing categorical handling in trees ( https://github.com/scikit-learn/scikit-learn/issues/15550, https://github.com/scikit-learn/scikit-learn/pull/12866)... -------------- next part -------------- An HTML attachment was scrubbed... URL: From fernando.wittmann at gmail.com Fri May 8 17:04:14 2020 From: fernando.wittmann at gmail.com (Fernando Marcos Wittmann) Date: Fri, 8 May 2020 18:04:14 -0300 Subject: [scikit-learn] Why the default max_samples of Random Forest is X.shape[0]? Message-ID: When reading the documentation of Random Forest, I got the following: ``` max_samples : int or float, default=None If bootstrap is True, the number of samples to draw from X to train each base estimator. - *If None (default), then draw `X.shape[0]` samples.* - If int, then draw `max_samples` samples. - If float, then draw `max_samples * X.shape[0]` samples. Thus, `max_samples` should be in the interval `(0, 1)`. ``` Why does the whole dataset (i.e. X.shape[0] samples from X) is used to build each tree? That would be equivalent to bootstrap to be False, right? Wouldn't it be better practices to use as default 2/3 of the size of the dataset? -------------- next part -------------- An HTML attachment was scrubbed... URL: From mlcnworkshop at gmail.com Sat May 9 14:41:16 2020 From: mlcnworkshop at gmail.com (MLCN Workshop) Date: Sat, 9 May 2020 20:41:16 +0200 Subject: [scikit-learn] [CFP] The 3rd International workshop on Machine Learning in Clinical Neuroimaging (MLCN 2020) Message-ID: *Please find below the call for papers for the International Workshop of Machine Learning in Clinical Neuroimaging (MLCN) on 4 October 2020 at MICCAI 2020 in Lima, Peru. We welcome contributions on novel machine learning methods and their applications to clinical neuroimaging data.* The submission deadline is *30 June 2020*, and all MLCN accepted papers will be eligible for the best paper award of 500 USD. For more information, please visit https://mlcnws.com/. Best wishes, The MLCN 2020 committee Christos Davatzikos Andre Marquand Jonas Richiardi Emma Robinson Ahmed Abdulkadir Cher Bass Mohamad Habes Seyed Mostafa Kia Jane Maryam Rondina Chantal Tax Hongzhi Wang Thomas Wolfers International Workshop on Machine Learning in Clinical Neuroimaging 4 October 2020 in Lima, Peru The International Workshop of Machine Learning in Clinical Neuroimaging ( https://mlcnws.com/), a satellite event of MICCAI (https://miccai2020.org), calls for original papers in the field of clinical neuroimaging data analysis with machine learning. The two tracks of the workshop include methodological innovations as well as clinical applications. This highly interdisciplinary topic provides an excellent platform to connect researchers of varying disciplines and to collectively advance the field in multiple directions. For the machine learning track, we seek contributions with substantial methodological novelty in analyzing high-dimensional, longitudinal, and heterogeneous neuroimaging data using stable, scalable, and interpretable machine learning models. Topics of interest include but are not limited to: - Spatio-temporal brain data analysis - Structural data analysis - Graph theory and complex network analysis - Longitudinal data analysis - Model stability and interpretability - Model scalability in large neuroimaging datasets - Multi-source data integration and multi-view learning - Multi-site data analysis, from preprocessing to modeling - Domain adaptation, data harmonization, and transfer learning in neuroimaging - Unsupervised methods for stratifying brain disorders - Deep learning in clinical neuroimaging - Model uncertainty in clinical predictions - ... In the clinical neuroimaging track, we seek contributions that explore how the application of advanced machine learning methods help us to move towards precision medicine for complex brain disorders. Topics of interest include but are not limited to: - Biomarker discovery - Refinement of nosology and diagnostics - Biological validation of clinical syndromes - Treatment outcome prediction - Course prediction - Analysis of wearable sensors - Neurogenetics and brain imaging genetics - Mechanistic modeling - Brain aging - ... Submission Process: The workshop seeks high quality, original, and unpublished work that addresses one or more challenges described above. Papers should be submitted electronically in Springer Lecture Notes in Computer Science (LCNS) style (see https://www.miccai2020.org/en/PAPER-SUBMISSION-GUIDELINE.html#manuscript-format for detailed author guidelines) using the CMT system at https://cmt3.research.microsoft.com/MLCN2020. The page limit is 8-pages (text, figures, and tables) plus up to 2-pages of references. We review the submissions in a double-blind process. Please make sure that your submission is anonymous. Accepted papers will be published in a joint proceeding with the MICCAI 2020 conference. Best Paper Award: This year, all MLCN accepted papers will be eligible for the best paper award. The recipient of the award will be chosen by the MLCN scientific committee based on the scientific quality and novelty of contributions. The winner will be announced at the end of the workshop and will receive 500 USD honorarium. Important Dates: - Paper submission deadline: June 30th, 2020 - Notification of Acceptance: July 24th, 2020 - Camera-ready Submission: July 31st, 2020 - Workshop Date: 4 October 2020 -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun May 10 08:49:33 2020 From: joel.nothman at gmail.com (Joel Nothman) Date: Sun, 10 May 2020 22:49:33 +1000 Subject: [scikit-learn] Why the default max_samples of Random Forest is X.shape[0]? In-Reply-To: References: Message-ID: A bootstrap is very commonly a random draw with replacement of equal size to the original sample. -------------- next part -------------- An HTML attachment was scrubbed... URL: From fernando.wittmann at gmail.com Sun May 10 18:40:40 2020 From: fernando.wittmann at gmail.com (Fernando Marcos Wittmann) Date: Sun, 10 May 2020 19:40:40 -0300 Subject: [scikit-learn] Why the default max_samples of Random Forest is X.shape[0]? In-Reply-To: References: Message-ID: My question is why the full dataset is being used as default when building each tree. That's not random forest. The main point of RF is to build each tree with a subsample of the full dataset On Sun, May 10, 2020, 09:50 Joel Nothman wrote: > A bootstrap is very commonly a random draw with replacement of equal size > to the original sample. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fernando.wittmann at gmail.com Sun May 10 18:42:41 2020 From: fernando.wittmann at gmail.com (Fernando Marcos Wittmann) Date: Sun, 10 May 2020 19:42:41 -0300 Subject: [scikit-learn] Why the default max_samples of Random Forest is X.shape[0]? In-Reply-To: References: Message-ID: Okay, so it's sampling with replacement with same size of the original dataset. That mean that some of the samples would be repeated for each tree On Sun, May 10, 2020, 19:40 Fernando Marcos Wittmann < fernando.wittmann at gmail.com> wrote: > My question is why the full dataset is being used as default when building > each tree. That's not random forest. The main point of RF is to build each > tree with a subsample of the full dataset > > On Sun, May 10, 2020, 09:50 Joel Nothman wrote: > >> A bootstrap is very commonly a random draw with replacement of equal size >> to the original sample. >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fernando.wittmann at gmail.com Mon May 11 01:20:07 2020 From: fernando.wittmann at gmail.com (Fernando Marcos Wittmann) Date: Mon, 11 May 2020 02:20:07 -0300 Subject: [scikit-learn] Why the default max_samples of Random Forest is X.shape[0]? In-Reply-To: References: Message-ID: Ohh, I can see now my mistake after reviewing the concept of bootstrapping and sampling with replacement. I was assuming that the "replacement" was made only after finishing each tree (i.e. If I was samping 2/3 of data, the very same data could be selected again for each tree, but no element would be repeated in a given tree). My apologies. Everything makes sense again On Sun, May 10, 2020, 19:42 Fernando Marcos Wittmann < fernando.wittmann at gmail.com> wrote: > Okay, so it's sampling with replacement with same size of the original > dataset. That mean that some of the samples would be repeated for each tree > > On Sun, May 10, 2020, 19:40 Fernando Marcos Wittmann < > fernando.wittmann at gmail.com> wrote: > >> My question is why the full dataset is being used as default when >> building each tree. That's not random forest. The main point of RF is to >> build each tree with a subsample of the full dataset >> >> On Sun, May 10, 2020, 09:50 Joel Nothman wrote: >> >>> A bootstrap is very commonly a random draw with replacement of equal >>> size to the original sample. >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From nelle.varoquaux at gmail.com Wed May 13 03:18:17 2020 From: nelle.varoquaux at gmail.com (Nelle Varoquaux) Date: Wed, 13 May 2020 09:18:17 +0200 Subject: [scikit-learn] Announcing the 2020 John Hunter Excellence in Plotting Contest Message-ID: Dear all, I apologize for the cross posting. This is a reminder for the John Hunter Excellence in Plotting Contest. In memory of John Hunter, we are pleased to announce the John Hunter Excellence in Plotting Contest for 2020. This open competition aims to highlight the importance of data visualization to scientific progress and showcase the capabilities of open source software. Participants are invited to submit scientific plots to be judged by a panel. The winning entries will be announced and displayed at SciPy 2020 or announced in the John Hunter Excellence in Plotting Contest website and youtube channel. John Hunter?s family are graciously sponsoring cash prizes for the winners in the following amounts: - 1st prize: $1000 - 2nd prize: $750 - 3rd prize: $500 - Entries must be submitted by June 1st to the form at https://forms.gle/SrexmkDwiAmDc7ej7 - Winners will be announced at Scipy 2020 in Austin, TX or publicly on the John Hunter Excellence in Plotting Contest website and youtube channel - Participants do not need to attend the Scipy conference. - Entries may take the definition of ?visualization? rather broadly. Entries may be, for example, a traditional printed plot, an interactive visualization for the web, a dashboard, or an animation. - Source code for the plot must be provided, in the form of Python code and/or a Jupyter notebook, along with a rendering of the plot in a widely used format. The rendering may be, for example, PDF for print, standalone HTML and Javascript for an interactive plot, or MPEG-4 for a video. If the original data can not be shared for reasons of size or licensing, "fake" data may be substituted, along with an image of the plot using real data. - Each entry must include a 300-500 word abstract describing the plot and its importance for a general scientific audience. - Entries will be judged on their clarity, innovation and aesthetics, but most importantly for their effectiveness in communicating a real-world problem. Entrants are encouraged to submit plots that were used during the course of research or work, rather than merely being hypothetical. - SciPy and the John Hunter Excellence in Plotting Contest organizers reserves the right to display any and all entries, whether prize-winning or not, at the conference, use in any materials or on its website, with attribution to the original author(s). - Past entries can be found at https://jhepc.github.io/ - Questions regarding the contest can be sent to jhepc.organizers at gmail.com John Hunter Excellence in Plotting Contest Co-Chairs Madicken Munk Nelle Varoquaux -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Wed May 13 09:19:43 2020 From: adrin.jalali at gmail.com (Adrin) Date: Wed, 13 May 2020 15:19:43 +0200 Subject: [scikit-learn] ANN scikit-learn 0.23.0 release Message-ID: We're happy to announce the 0.23.0 release. You can read the release highlights under https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_0_23_0.html and the long version of the change log under https://scikit-learn.org/stable/whats_new/v0.23.html#version-0-23-0 On top of a few exciting features, we're also deprecating positional arguments in many places where the constructor/method accepts many arguments. for example, SVC(.5, "poly") will need to be expressed as SVC(C=.5, kernel="poly"), and SVC(C, kernel) as SVC(C=C, kernel=kernel). This version supports Python versions 3.6 to 3.8. You can give it a go using `pip install -U scikit-learn` or alternatively `conda install -c conda-forge scikit-learn` which should be there soon. Regards, Adrin, on behalf of the scikit-learn maintainer team. -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Wed May 13 11:22:05 2020 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Wed, 13 May 2020 17:22:05 +0200 Subject: [scikit-learn] ANN scikit-learn 0.23.0 release In-Reply-To: References: Message-ID: Congrats on the release! And thank you very much to all those who were involved in making it happen (and Adrin in particular)! -- Olivier From godefroi.catherine at gmail.com Wed May 13 13:15:38 2020 From: godefroi.catherine at gmail.com (Compte.validation) Date: Wed, 13 May 2020 19:15:38 +0200 Subject: [scikit-learn] ANN scikit-learn 0.23.0 release In-Reply-To: References: Message-ID: STOP MAILING ME Le mer. 13 mai 2020 ? 17:23, Olivier Grisel a ?crit : > Congrats on the release! And thank you very much to all those who were > involved in making it happen (and Adrin in particular)! > > -- > Olivier > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vimalthilak at gmail.com Wed May 13 13:32:27 2020 From: vimalthilak at gmail.com (Vimal Thilak) Date: Wed, 13 May 2020 10:32:27 -0700 Subject: [scikit-learn] ANN scikit-learn 0.23.0 release In-Reply-To: References: Message-ID: On Wed, May 13, 2020 at 10:17 AM Compte.validation < godefroi.catherine at gmail.com> wrote: > STOP MAILING ME > This mailing list is an opt-in and as such you have complete freedom to opt-out via https://mail.python.org/mailman/listinfo/scikit-learn Best, > > Le mer. 13 mai 2020 ? 17:23, Olivier Grisel a > ?crit : > >> Congrats on the release! And thank you very much to all those who were >> involved in making it happen (and Adrin in particular)! >> >> -- >> Olivier >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue May 19 12:43:54 2020 From: t3kcit at gmail.com (t3kcit at gmail.com) Date: Tue, 19 May 2020 12:43:54 -0400 Subject: [scikit-learn] major league hacking summer internship program Message-ID: <066301d62dfc$af92cb90$0eb862b0$@gmail.com> Hey Folks. This program reached out to me: https://news.mlh.io/mlh-fellowship-the-future-of-tech-internships-05-04-2020 What do you think? Sounds like GSOC but with extra mentorship, so it might be a good fit for us? I would say it depends on what level of involvement they require from us. Best, Andy -------------- next part -------------- An HTML attachment was scrubbed... URL: From marmochiaskl at gmail.com Tue May 19 13:49:50 2020 From: marmochiaskl at gmail.com (Chiara Marmo) Date: Tue, 19 May 2020 19:49:50 +0200 Subject: [scikit-learn] Notes core-dev meeting May 25th Message-ID: Dear core-devs, I've taken the liberty to build a skeleton for the meeting notes. Please, have a look and let me know if this is useful. For each section, please, consider to add no more than two issues or PRs per person to be discussed, this will allow everybody speak and close the meeting in a reasonable time. Also, it could be really useful to have all the contributions before the week-end as everybody could take a look to the referenced discussions. A more detailed reminder will be sent soon with a call for contribution from the community and the details for the connection. Thanks for your attention. Best, Chiara -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Tue May 19 15:32:52 2020 From: adrin.jalali at gmail.com (Adrin) Date: Tue, 19 May 2020 21:32:52 +0200 Subject: [scikit-learn] ANN: scikit-learn 0.23.1 release Message-ID: We're happy to announce the 0.23.1 release which fixes a few issues affecting many users, namely: K-Means should be faster for small sample sizes, and the representation of third-party estimators was fixed. You can check this version out using: pip install -U scikit-learn You can see the changelog here: https://scikit-learn.org/stable/whats_new/v0.23.html#version-0-23-1 The conda-forge builds will be available shortly, which you can then install using: conda install -c conda-forge scikit-learn On behalf of the scikit-learn development community, Adrin -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Tue May 19 15:39:21 2020 From: adrin.jalali at gmail.com (Adrin) Date: Tue, 19 May 2020 21:39:21 +0200 Subject: [scikit-learn] Notes core-dev meeting May 25th In-Reply-To: References: Message-ID: Thanks Chiara, I think I'm missing the link to the agenda. Where should I find it? Thanks, Adrin On Tue, May 19, 2020 at 7:51 PM Chiara Marmo wrote: > Dear core-devs, > > I've taken the liberty to build a skeleton for the meeting notes. > Please, have a look and let me know if this is useful. > > For each section, please, consider to add no more than two issues or PRs > per person to be discussed, this will allow everybody speak and close the > meeting in a reasonable time. > Also, it could be really useful to have all the contributions before the > week-end as everybody could take a look to the referenced discussions. > > A more detailed reminder will be sent soon with a call for contribution > from the community and the details for the connection. > > Thanks for your attention. > > Best, > Chiara > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From niourf at gmail.com Tue May 19 15:41:07 2020 From: niourf at gmail.com (Nicolas Hug) Date: Tue, 19 May 2020 15:41:07 -0400 Subject: [scikit-learn] Notes core-dev meeting May 25th In-Reply-To: References: Message-ID: <05bef331-9f89-5012-3e2a-cd2e65db1bd6@gmail.com> https://hackmd.io/4VeWX5H9Tlmz132WAD-Q0w On 5/19/20 3:39 PM, Adrin wrote: > Thanks Chiara, > > I think I'm missing the link to the agenda. Where should I find it? > > Thanks, > Adrin > > On Tue, May 19, 2020 at 7:51 PM Chiara Marmo > wrote: > > Dear core-devs, > > I've taken the liberty to build a skeleton for the meeting notes. > Please, have a look and let me know if this is useful. > > For each section, please, consider to add no more than two issues > or PRs per person to be discussed, this will allow everybody speak > and close the meeting in a reasonable time. > Also, it could be really useful to have all the contributions > before the week-end as everybody could take a look to the > referenced discussions. > > A more detailed reminder will be sent soon with a call for > contribution from the community and the details for the connection. > > Thanks for your attention. > > Best, > Chiara > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Tue May 19 15:42:26 2020 From: adrin.jalali at gmail.com (Adrin) Date: Tue, 19 May 2020 21:42:26 +0200 Subject: [scikit-learn] major league hacking summer internship program In-Reply-To: <066301d62dfc$af92cb90$0eb862b0$@gmail.com> References: <066301d62dfc$af92cb90$0eb862b0$@gmail.com> Message-ID: Sounds pretty cool to me. On Tue, May 19, 2020 at 6:45 PM wrote: > Hey Folks. > > This program reached out to me: > > > https://news.mlh.io/mlh-fellowship-the-future-of-tech-internships-05-04-2020 > > > > What do you think? > > Sounds like GSOC but with extra mentorship, so it might be a good fit for > us? > > I would say it depends on what level of involvement they require from us. > > > > Best, > > Andy > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From marmochiaskl at gmail.com Tue May 19 16:58:41 2020 From: marmochiaskl at gmail.com (Chiara Marmo) Date: Tue, 19 May 2020 22:58:41 +0200 Subject: [scikit-learn] Notes core-dev meeting May 25th In-Reply-To: <05bef331-9f89-5012-3e2a-cd2e65db1bd6@gmail.com> References: <05bef331-9f89-5012-3e2a-cd2e65db1bd6@gmail.com> Message-ID: Thanks Nicolas for answering. I think I'm missing the link to the agenda. Where should I find it? > > Adrin, just as a side note, you could also find the address of the pad in the invitations I've sent for the meeting. Best, Chiara -------------- next part -------------- An HTML attachment was scrubbed... URL: From godefroi.catherine at gmail.com Tue May 19 17:17:34 2020 From: godefroi.catherine at gmail.com (Compte.validation) Date: Tue, 19 May 2020 23:17:34 +0200 Subject: [scikit-learn] Notes core-dev meeting May 25th In-Reply-To: References: <05bef331-9f89-5012-3e2a-cd2e65db1bd6@gmail.com> Message-ID: STOP MAIL Le mar. 19 mai 2020 ? 23:00, Chiara Marmo a ?crit : > Thanks Nicolas for answering. > > I think I'm missing the link to the agenda. Where should I find it? >> >> Adrin, just as a side note, you could also find the address of the pad in > the invitations I've sent for the meeting. > Best, > Chiara > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From marmochiaskl at gmail.com Fri May 22 05:10:55 2020 From: marmochiaskl at gmail.com (Chiara Marmo) Date: Fri, 22 May 2020 11:10:55 +0200 Subject: [scikit-learn] scikit-learn monthly meeting May 25th Message-ID: Hi all, The next scikit-learn monthly meeting will take place on Monday May 25th at 12PM UTC: https://www.timeanddate.com/worldclock/meetingdetails.html?year=2020&month=5&day=25&hour=12&min=0&sec=0&p1=240&p2=33&p3=37&p4=179&p5=195 While these meetings are mainly for core-devs to discuss the current topics, we're also happy to welcome non-core devs and other projects maintainers! Feel free to join, using the following link: https://meet.google.com/xhq-yoga-rtf If you plan to attend and you would like to discuss something specific about your contribution please add your name (or github pseudo) in the " Contributors " section, of the public pad: https://hackmd.io/4VeWX5H9Tlmz132WAD-Q0w *@core devs, please make sure to update your notes before the week-end.* Best, Chiara -------------- next part -------------- An HTML attachment was scrubbed... URL: From niourf at gmail.com Fri May 22 07:32:41 2020 From: niourf at gmail.com (Nicolas Hug) Date: Fri, 22 May 2020 07:32:41 -0400 Subject: [scikit-learn] scikit-learn monthly meeting May 25th In-Reply-To: References: Message-ID: <97816aab-adbd-f20d-f50b-fa20b4e83a70@gmail.com> These were last month notes ;)? (the text of the link was correct, but the href wasn't) The new pad is at https://hackmd.io/4VeWX5H9Tlmz132WAD-Q0w On 5/22/20 5:10 AM, Chiara Marmo wrote: > > Hi all, > > The next scikit-learn monthly meeting will take place on Monday May > 25th at 12PM UTC: > https://www.timeanddate.com/worldclock/meetingdetails.html?year=2020&month=5&day=25&hour=12&min=0&sec=0&p1=240&p2=33&p3=37&p4=179&p5=195 > > While these meetings are mainly for core-devs to discuss the current > topics, we're also happy to welcome non-core devs and other projects > maintainers! Feel free to join, using the following link: > > https://meet.google.com/xhq-yoga-rtf > > If you plan to attend and you would like to discuss something specific > about your contribution please add your name (or github pseudo) in the > "Contributors " > section, of the public pad: > > https://hackmd.io/4VeWX5H9Tlmz132WAD-Q0w > > > > *@core devs, please make sure to update your notes before the week-end. > * > > > Best, > > Chiara > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From andreasmuellerml at gmail.com Mon May 25 16:41:48 2020 From: andreasmuellerml at gmail.com (Andreas C. Mueller) Date: Mon, 25 May 2020 16:41:48 -0400 Subject: [scikit-learn] Welcome to the TC Adrin! Message-ID: <001101d632d4$ea133070$be399150$@gmail.com> Dear all! It's my pleasure to announce that Adrin has been accepted into the Technical Committee unanimously! Thank you Adrin for all the hard work so far and for your contributions to the community! Best, Andy Ps: yes I have a new email address -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Tue May 26 03:23:11 2020 From: adrin.jalali at gmail.com (Adrin) Date: Tue, 26 May 2020 09:23:11 +0200 Subject: [scikit-learn] Welcome to the TC Adrin! In-Reply-To: <001101d632d4$ea133070$be399150$@gmail.com> References: <001101d632d4$ea133070$be399150$@gmail.com> Message-ID: Hi, Thank you all for all your support, patience, and your trust on me since I started :) I hope to serve the community well. Best, Adrin On Mon, May 25, 2020 at 10:43 PM Andreas C. Mueller < andreasmuellerml at gmail.com> wrote: > Dear all! > > > > It?s my pleasure to announce that Adrin has been accepted into the > Technical Committee unanimously! > > Thank you Adrin for all the hard work so far and for your contributions to > the community! > > > > Best, > > Andy > > > > Ps: yes I have a new email address > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Wed May 27 10:25:55 2020 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Wed, 27 May 2020 16:25:55 +0200 Subject: [scikit-learn] Random Binning Features In-Reply-To: References: Message-ID: The algorithm in scikit-learn-extra are usually algorithms which did not meet the inclusion criteria (too early publication, not enough citations, etc.) However, the code quality is as good and tested than scikit-learn (usually they were PR in the main repository). Doing in this manner allows us to find the impact of the algorithms in practice and maybe considering waiving-up the inclusion criterion. On Sat, 2 May 2020 at 06:59, sai_ng wrote: > Hey folks ! > Hope you're all doing well. > > I'm developing Random Fourier Feature implementation in c++ for a > repository. Scikits implementation on RBFSampler has been really helpful, > and I must say that I'm charmed but how compact, yet powerful each line of > code is. > > I'm writing this mail because I couldn't find your implementation of > Random Binning Features, is it under development?. I tried searching in the > issues but, to no avail. I noticed you've put few of your algorithms on a > different repository for ex: > https://scikit-learn-extra.readthedocs.io/en/latest/generated/sklearn_extra.kernel_approximation.Fastfood.html. > > > Overall, I'd like to know if it's under development or has there been any > draft/proposal or is it already implemented. I'd greatly appreciate if you > could point me to other sources (if not here) which have successfully > implemented it in code (preferably python/c++) > > "Hit me back, > Just to chat, > Your biggest fan, > This is stan" > ~ Eminem: Stan > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Wed May 27 11:14:06 2020 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Wed, 27 May 2020 17:14:06 +0200 Subject: [scikit-learn] Class weight SVC In-Reply-To: References: Message-ID: I don't think that we rescale the sample_weight and therefore the results should be different. On Fri, 24 Apr 2020 at 12:41, Francesco basciani wrote: > Hi, i have a question regarding the class weights in SVC. I have an > imbalanced binary classification problem. In my case the ratio between the > positive class and the negative class is 4:1. I just want to know if > setting class weight to: > > class_weight = {1: 0.25, 0: 1} is the same to setting it to: > > class_weight = {1: 1, 0: 4}. > > > > Because in my case i obtain differents results using the two definition of > the class weight > > > > Inviato da Posta per > Windows 10 > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Wed May 27 11:53:15 2020 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Wed, 27 May 2020 17:53:15 +0200 Subject: [scikit-learn] [GridSearchCV] Reduction of elapsed time at the second interation In-Reply-To: References: Message-ID: Regarding scikit-learn, the only thing that we cache is the transformer processing in the pipeline (see the memory parameter in Pipeline). It seems that you are passing a different set of features at each iteration. Is the number of features different? On Sun, 29 Mar 2020 at 19:23, Pedro Cardoso wrote: > Hello fellows, > > i am knew at slkearn and I have a question about GridSearchCV: > > I am running the following code at a jupyter notebook : > > ----------------------*code*------------------------------- > > opt_models = dict() > for feature in [features1, features2, features3, features4]: > cmb = CMB(x_train, y_train, x_test, y_test, feature) > cmb.fit() > cmb.predict() > opt_models[str(feature)]=cmb.get_best_model() > > ------------------------------------------------------- > > The CMB class is just a class that contains different classification > models (SVC, decision tree, etc...). When cmb.fit() is running, a > gridSearchCV is performed at the SVC model (which is within the cmb > instance) in order to tune the hyperparameters C, gamma, and kernel. The > SCV model is implemented using the sklearn.svm.SVC class. Here is the > output of the first and second iteration of the for loop: > > ---------------------*output*------------------------------------- > -> 1st iteration > > > Fitting 5 folds for each of 12 candidates, totalling 60 fits > > [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. > [Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 6.1s > [Parallel(n_jobs=-1)]: Done 2 tasks | elapsed: 6.1s > [Parallel(n_jobs=-1)]: Done 3 tasks | elapsed: 6.1s > [Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 6.2s > [Parallel(n_jobs=-1)]: Done 5 tasks | elapsed: 6.2s > [Parallel(n_jobs=-1)]: Done 6 tasks | elapsed: 6.2s > [Parallel(n_jobs=-1)]: Done 7 tasks | elapsed: 6.2s > [Parallel(n_jobs=-1)]: Done 8 tasks | elapsed: 6.2s > [Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 6.2s > [Parallel(n_jobs=-1)]: Done 10 tasks | elapsed: 6.2s > [Parallel(n_jobs=-1)]: Done 11 tasks | elapsed: 6.2s > [Parallel(n_jobs=-1)]: Done 12 tasks | elapsed: 6.3s > [Parallel(n_jobs=-1)]: Done 13 tasks | elapsed: 6.3s > [Parallel(n_jobs=-1)]: Done 14 tasks | elapsed: 6.3s > [Parallel(n_jobs=-1)]: Done 15 tasks | elapsed: 6.4s > [Parallel(n_jobs=-1)]: Done 16 tasks | elapsed: 6.4s > [Parallel(n_jobs=-1)]: Done 17 tasks | elapsed: 6.4s > [Parallel(n_jobs=-1)]: Done 18 tasks | elapsed: 6.4s > [Parallel(n_jobs=-1)]: Done 19 tasks | elapsed: 6.5s > [Parallel(n_jobs=-1)]: Done 20 tasks | elapsed: 6.5s > [Parallel(n_jobs=-1)]: Done 21 tasks | elapsed: 6.5s > [Parallel(n_jobs=-1)]: Done 22 tasks | elapsed: 6.6s > [Parallel(n_jobs=-1)]: Done 23 tasks | elapsed: 6.7s > [Parallel(n_jobs=-1)]: Done 24 tasks | elapsed: 6.7s > [Parallel(n_jobs=-1)]: Done 25 tasks | elapsed: 6.7s > [Parallel(n_jobs=-1)]: Done 26 tasks | elapsed: 6.8s > [Parallel(n_jobs=-1)]: Done 27 tasks | elapsed: 6.8s > [Parallel(n_jobs=-1)]: Done 28 tasks | elapsed: 6.9s > [Parallel(n_jobs=-1)]: Done 29 tasks | elapsed: 6.9s > [Parallel(n_jobs=-1)]: Done 30 tasks | elapsed: 6.9s > [Parallel(n_jobs=-1)]: Done 31 tasks | elapsed: 7.0s > [Parallel(n_jobs=-1)]: Done 32 tasks | elapsed: 7.0s > [Parallel(n_jobs=-1)]: Done 33 tasks | elapsed: 7.0s > [Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 7.0s > [Parallel(n_jobs=-1)]: Done 35 tasks | elapsed: 7.1s > [Parallel(n_jobs=-1)]: Done 36 tasks | elapsed: 7.1s > [Parallel(n_jobs=-1)]: Done 37 tasks | elapsed: 7.2s > [Parallel(n_jobs=-1)]: Done 38 tasks | elapsed: 7.2s > [Parallel(n_jobs=-1)]: Done 39 tasks | elapsed: 7.2s > [Parallel(n_jobs=-1)]: Done 40 tasks | elapsed: 7.2s > [Parallel(n_jobs=-1)]: Done 41 tasks | elapsed: 7.3s > [Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 7.3s > [Parallel(n_jobs=-1)]: Done 43 tasks | elapsed: 7.3s > [Parallel(n_jobs=-1)]: Done 44 tasks | elapsed: 7.4s > [Parallel(n_jobs=-1)]: Done 45 tasks | elapsed: 7.4s > [Parallel(n_jobs=-1)]: Done 46 tasks | elapsed: 7.5s > > > -> 2nd iteration > > Fitting 5 folds for each of 12 candidates, totalling 60 fits > > [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. > [Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.0s > [Parallel(n_jobs=-1)]: Batch computation too fast (0.0260s.) Setting batch_size=14. > [Parallel(n_jobs=-1)]: Done 2 tasks | elapsed: 0.0s > [Parallel(n_jobs=-1)]: Done 3 tasks | elapsed: 0.0s > [Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 0.0s > [Parallel(n_jobs=-1)]: Done 5 tasks | elapsed: 0.0s > [Parallel(n_jobs=-1)]: Done 60 out of 60 | elapsed: 0.7s finished > > > --------------------------------------------------------------------------------------------------------------------- > > > As you can see, the first iteration gets a elapsed time much larger than > the 2nd iteration. Does it make sense? I am afraid that the model is doing > some kind of cache or shortcut from the 1st iteration, and consequently > could decrease the model training/performance? I already read the sklearn > documentation and I didn't saw any warning/note about this kind of > behaviour. > > Thank you very much for your time :) > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From andreasmuellerml at gmail.com Thu May 28 17:34:02 2020 From: andreasmuellerml at gmail.com (Andreas C. Mueller) Date: Thu, 28 May 2020 17:34:02 -0400 Subject: [scikit-learn] major league hacking summer internship program In-Reply-To: References: <066301d62dfc$af92cb90$0eb862b0$@gmail.com> Message-ID: <050001d63537$b4fdbe90$1ef93bb0$@gmail.com> Hi Folks. So this program sounds pretty cool. They preselected some people for an ML work group, who will be doing daily standups together and pair programming, and who might move around between some related projects over the 12 weeks of the program. They made sure to get a diverse set of students and they have an engineer that will supervised them. They would probably have 2-3 students working on sklearn. They don?t expect having one big feature but they do expect some guidance on what issues to work on. Also, the program starts on Monday, and they start contributing to OSS projects about a week after that. Ideally we?d tell them if we?re in or not before Monday, and have a tentative list of issues / projects. What do you all think? Also, if we want to do it, who would have cycles for some reviewing? This seems to be well organized and they seem to have put quite some thought into it, but we do need to do a little bit of work on our end. I can try picking some issues but I probably can?t commit a lot of reviewing time. Cheers, Andy From: scikit-learn On Behalf Of Adrin Sent: Tuesday, May 19, 2020 3:42 PM To: Scikit-learn mailing list Subject: Re: [scikit-learn] major league hacking summer internship program Sounds pretty cool to me. On Tue, May 19, 2020 at 6:45 PM > wrote: Hey Folks. This program reached out to me: https://news.mlh.io/mlh-fellowship-the-future-of-tech-internships-05-04-2020 What do you think? Sounds like GSOC but with extra mentorship, so it might be a good fit for us? I would say it depends on what level of involvement they require from us. Best, Andy _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From cdwijayarathna at gmail.com Fri May 29 01:03:12 2020 From: cdwijayarathna at gmail.com (Chamila Wijayarathna) Date: Fri, 29 May 2020 10:33:12 +0530 Subject: [scikit-learn] sklearn Pipeline: argument of type 'ColumnTransformer' is not iterable Message-ID: Hello all, I hope I am writing to the correct mailing list about this issue that I am having. Please apologize me if I am not. I am attempting to use a pipeline to feed an ensemble voting classifier as I want the ensemble learner to use models that train on different feature sets. For this purpose, I followed the tutorial available at [1]. Following is the code that I could develop so far. *y = df1.indexx = preprocessing.scale(df1)phy_features = ['A', 'B', 'C']phy_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])phy_processer = ColumnTransformer(transformers=[('phy', phy_transformer, phy_features)])fa_features = ['D', 'E', 'F']fa_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])fa_processer = ColumnTransformer(transformers=[('fa', fa_transformer, fa_features)])pipe_phy = Pipeline(steps=[('preprocessor', phy_processer ),('classifier', SVM)])pipe_fa = Pipeline(steps=[('preprocessor', fa_processer ),('classifier', SVM)])ens = VotingClassifier(estimators=[pipe_phy, pipe_fa])cv = KFold(n_splits=10, random_state=None, shuffle=True)for train_index, test_index in cv.split(x): x_train, x_test = x[train_index], x[test_index] y_train, y_test = y[train_index], y[test_index] ens.fit(x_train,y_train) print(ens.score(x_test, y_test))* However, when running the code, I am getting an error saying *TypeError: argument of type 'ColumnTransformer' is not iterable*, at the line *ens.fit(x_train,y_train).* What is the reason for this and how can I fix it? Thank you, Chamila -------------- next part -------------- An HTML attachment was scrubbed... URL: From cdwijayarathna at gmail.com Fri May 29 02:30:46 2020 From: cdwijayarathna at gmail.com (Chamila Wijayarathna) Date: Fri, 29 May 2020 12:00:46 +0530 Subject: [scikit-learn] sklearn Pipeline: argument of type 'ColumnTransformer' is not iterable In-Reply-To: References: Message-ID: Hi all, I did manage to get the code to run using a workaround, which is bit ugly. Following is the complete stacktrace of the error I was receiving. *Traceback (most recent call last): File "", line 1, in File "C:\Program Files\JetBrains\PyCharm 2020.1.1\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile pydev_imports.execfile(filename, global_vars, local_vars) # execute the script File "C:\Program Files\JetBrains\PyCharm 2020.1.1\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "C:/Users/ASUS/PycharmProjects/swelltest/enemble.py", line 112, in ens.fit(x_train,y_train) File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_voting.py", line 265, in fit return super().fit(X, transformed_y, sample_weight) File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_voting.py", line 65, in fit names, clfs = self._validate_estimators() File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_base.py", line 228, in _validate_estimators self._validate_names(names) File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\utils\metaestimators.py", line 77, in _validate_names invalid_names = [name for name in names if '__' in name] File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\utils\metaestimators.py", line 77, in invalid_names = [name for name in names if '__' in name]TypeError: argument of type 'ColumnTransformer' is not iterable* Following are the inputs in 'names' list at the time of the error. 1- *ColumnTransformer(transformers=[('phy', Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]), ['HR', 'RMSSD', 'SCL'])])2- ColumnTransformer(transformers=[('fa',Pipeline(steps=[('imputer',SimpleImputer(strategy='median')),('scaler', StandardScaler())]),['Squality', 'Sneutral', 'Shappy'])])* Seems like that the library is attempting to search for '__' substring of the ColumnTransform object, which it is unable to perform. Since this name check doesn't have a signiticant effect on my functionality, I commented following snippet at *sklearn\utils\metaestimators.py.* *invalid_names = [name for name in names if '__' in name]if invalid_names: raise ValueError('Estimator names must not contain __: got ' '{0!r}'.format(invalid_names))* Please let me know if there is a better workaround or that their are any issues of commenting out this code. Thanks On Fri, May 29, 2020 at 10:33 AM Chamila Wijayarathna < cdwijayarathna at gmail.com> wrote: > Hello all, > > I hope I am writing to the correct mailing list about this issue that I am > having. Please apologize me if I am not. > > I am attempting to use a pipeline to feed an ensemble voting classifier as > I want the ensemble learner to use models that train on different feature > sets. For this purpose, I followed the tutorial available at [1]. > > Following is the code that I could develop so far. > > > > > > > > > > > > > > > > > > > > > > > > *y = df1.indexx = preprocessing.scale(df1)phy_features = ['A', 'B', > 'C']phy_transformer = Pipeline(steps=[('imputer', > SimpleImputer(strategy='median')), ('scaler', > StandardScaler())])phy_processer = ColumnTransformer(transformers=[('phy', > phy_transformer, phy_features)])fa_features = ['D', 'E', 'F']fa_transformer > = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', > StandardScaler())])fa_processer = ColumnTransformer(transformers=[('fa', > fa_transformer, fa_features)])pipe_phy = Pipeline(steps=[('preprocessor', > phy_processer ),('classifier', SVM)])pipe_fa = > Pipeline(steps=[('preprocessor', fa_processer ),('classifier', SVM)])ens = > VotingClassifier(estimators=[pipe_phy, pipe_fa])cv = KFold(n_splits=10, > random_state=None, shuffle=True)for train_index, test_index in > cv.split(x): x_train, x_test = x[train_index], x[test_index] y_train, > y_test = y[train_index], y[test_index] ens.fit(x_train,y_train) > print(ens.score(x_test, y_test))* > > However, when running the code, I am getting an error saying *TypeError: > argument of type 'ColumnTransformer' is not iterable*, at the line > *ens.fit(x_train,y_train).* > > What is the reason for this and how can I fix it? > > Thank you, > Chamila > -- Chamila Dilshan Wijayarathna, PhD Research Student The University of New South Wales (UNSW Canberra) Australian Centre for Cyber Security Australian Defence Force Academy PO Box 7916, Canberra BA ACT 2610 Australia Mobile:(+61)416895795 -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Fri May 29 05:40:48 2020 From: adrin.jalali at gmail.com (Adrin) Date: Fri, 29 May 2020 11:40:48 +0200 Subject: [scikit-learn] major league hacking summer internship program In-Reply-To: <050001d63537$b4fdbe90$1ef93bb0$@gmail.com> References: <066301d62dfc$af92cb90$0eb862b0$@gmail.com> <050001d63537$b4fdbe90$1ef93bb0$@gmail.com> Message-ID: Thanks Andy, sounds pretty cool. I can commit some reviewing time. There should be maybe two of us at least that they know they can ping, and we can ping others if needed. Cheers, Adrin On Thu, May 28, 2020 at 11:35 PM Andreas C. Mueller < andreasmuellerml at gmail.com> wrote: > Hi Folks. > > So this program sounds pretty cool. They preselected some people for an ML > work group, who will be doing daily standups together and pair programming, > and who might move around between some related projects over the 12 weeks > of the program. > > They made sure to get a diverse set of students and they have an engineer > that will supervised them. > > They would probably have 2-3 students working on sklearn. > > They don?t expect having one big feature but they do expect some guidance > on what issues to work on. > > Also, the program starts on Monday, and they start contributing to OSS > projects about a week after that. > > Ideally we?d tell them if we?re in or not before Monday, and have a > tentative list of issues / projects. > > > > What do you all think? Also, if we want to do it, who would have cycles > for some reviewing? > > This seems to be well organized and they seem to have put quite some > thought into it, but we do need to do a little bit of work on our end. > > I can try picking some issues but I probably can?t commit a lot of > reviewing time. > > > > Cheers, > > Andy > > > > > > *From:* scikit-learn *On > Behalf Of *Adrin > *Sent:* Tuesday, May 19, 2020 3:42 PM > *To:* Scikit-learn mailing list > *Subject:* Re: [scikit-learn] major league hacking summer internship > program > > > > Sounds pretty cool to me. > > > > On Tue, May 19, 2020 at 6:45 PM wrote: > > Hey Folks. > > This program reached out to me: > > > https://news.mlh.io/mlh-fellowship-the-future-of-tech-internships-05-04-2020 > > > > What do you think? > > Sounds like GSOC but with extra mentorship, so it might be a good fit for > us? > > I would say it depends on what level of involvement they require from us. > > > > Best, > > Andy > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Fri May 29 05:48:24 2020 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Fri, 29 May 2020 11:48:24 +0200 Subject: [scikit-learn] major league hacking summer internship program In-Reply-To: References: <066301d62dfc$af92cb90$0eb862b0$@gmail.com> <050001d63537$b4fdbe90$1ef93bb0$@gmail.com> Message-ID: Hey, I can dedicate some time to review. Cheers, On Fri, 29 May 2020 at 11:43, Adrin wrote: > Thanks Andy, sounds pretty cool. > > I can commit some reviewing time. There should be maybe two of us at least > that they know they can ping, and we can ping others if needed. > > Cheers, > Adrin > > On Thu, May 28, 2020 at 11:35 PM Andreas C. Mueller < > andreasmuellerml at gmail.com> wrote: > >> Hi Folks. >> >> So this program sounds pretty cool. They preselected some people for an >> ML work group, who will be doing daily standups together and pair >> programming, >> and who might move around between some related projects over the 12 weeks >> of the program. >> >> They made sure to get a diverse set of students and they have an engineer >> that will supervised them. >> >> They would probably have 2-3 students working on sklearn. >> >> They don?t expect having one big feature but they do expect some guidance >> on what issues to work on. >> >> Also, the program starts on Monday, and they start contributing to OSS >> projects about a week after that. >> >> Ideally we?d tell them if we?re in or not before Monday, and have a >> tentative list of issues / projects. >> >> >> >> What do you all think? Also, if we want to do it, who would have cycles >> for some reviewing? >> >> This seems to be well organized and they seem to have put quite some >> thought into it, but we do need to do a little bit of work on our end. >> >> I can try picking some issues but I probably can?t commit a lot of >> reviewing time. >> >> >> >> Cheers, >> >> Andy >> >> >> >> >> >> *From:* scikit-learn *On >> Behalf Of *Adrin >> *Sent:* Tuesday, May 19, 2020 3:42 PM >> *To:* Scikit-learn mailing list >> *Subject:* Re: [scikit-learn] major league hacking summer internship >> program >> >> >> >> Sounds pretty cool to me. >> >> >> >> On Tue, May 19, 2020 at 6:45 PM wrote: >> >> Hey Folks. >> >> This program reached out to me: >> >> >> https://news.mlh.io/mlh-fellowship-the-future-of-tech-internships-05-04-2020 >> >> >> >> What do you think? >> >> Sounds like GSOC but with extra mentorship, so it might be a good fit for >> us? >> >> I would say it depends on what level of involvement they require from us. >> >> >> >> Best, >> >> Andy >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomasjpfan at gmail.com Fri May 29 09:44:16 2020 From: thomasjpfan at gmail.com (Thomas J Fan) Date: Fri, 29 May 2020 09:44:16 -0400 Subject: [scikit-learn] major league hacking summer internship program In-Reply-To: References: <066301d62dfc$af92cb90$0eb862b0$@gmail.com> <050001d63537$b4fdbe90$1ef93bb0$@gmail.com> Message-ID: I can commit to reviewing. Diving into their program, it looks like they are hiring supervisers through: https://raise.dev/Apply/?ref=mlh which is titled "Software Developer Coach". By looking at their https://fellowship.mlh.io/students they have about 9 weeks of actual contributing. Given they have an engineer to help, maybe they can work on the documenting the production aspects: 1. Roadmap item 19: Documentation and tooling for model lifecycle management 2. Roadmap item 21: Document good practices to detect temporal distrubiton drift Regards, Thomas > On Thursday, May 28, 2020 at 5:36 PM, Andreas C. Mueller wrote: > > Hi Folks. > > > So this program sounds pretty cool. They preselected some people for an ML work group, who will be doing daily standups together and pair programming, > and who might move around between some related projects over the 12 weeks of the program. > > > They made sure to get a diverse set of students and they have an engineer that will supervised them. > > > They would probably have 2-3 students working on sklearn. > > > They don?t expect having one big feature but they do expect some guidance on what issues to work on. > > > Also, the program starts on Monday, and they start contributing to OSS projects about a week after that. > > > Ideally we?d tell them if we?re in or not before Monday, and have a tentative list of issues / projects. > > > > > > What do you all think? Also, if we want to do it, who would have cycles for some reviewing? > > > This seems to be well organized and they seem to have put quite some thought into it, but we do need to do a little bit of work on our end. > > > I can try picking some issues but I probably can?t commit a lot of reviewing time. > > > > > > Cheers, > > > Andy > > > > > > > > > From: scikit-learn On Behalf Of Adrin > Sent: Tuesday, May 19, 2020 3:42 PM > To: Scikit-learn mailing list > Subject: Re: [scikit-learn] major league hacking summer internship program > > > > > > > > Sounds pretty cool to me. > > > > > > > > On Tue, May 19, 2020 at 6:45 PM wrote: > > > > > > Hey Folks. > > > > > > This program reached out to me: > > > > > > https://news.mlh.io/mlh-fellowship-the-future-of-tech-internships-05-04-2020 > > > > > > > > > > > > What do you think? > > > > > > Sounds like GSOC but with extra mentorship, so it might be a good fit for us? > > > > > > I would say it depends on what level of involvement they require from us. > > > > > > > > > > > > Best, > > > > > > Andy > > > > > > > > > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org (mailto:scikit-learn at python.org) > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomasjpfan at gmail.com Fri May 29 10:02:26 2020 From: thomasjpfan at gmail.com (Thomas J Fan) Date: Fri, 29 May 2020 10:02:26 -0400 Subject: [scikit-learn] sklearn Pipeline: argument of type 'ColumnTransformer' is not iterable In-Reply-To: References: Message-ID: <52354c90-c217-42e4-86d7-77d1eb8c30a9@Canary> VotingClassifer also needs names: ens = VotingClassifier(estimators=[('pipe1', pipe_phy), ('pipe2', pipe_fa)]) Thomas > On Friday, May 29, 2020 at 2:33 AM, Chamila Wijayarathna wrote: > Hi all, > > I did manage to get the code to run using a workaround, which is bit ugly. > > Following is the complete stacktrace of the error I was receiving. > > Traceback (most recent call last): > File "", line 1, in > File "C:\Program Files\JetBrains\PyCharm 2020.1.1\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile > pydev_imports.execfile(filename, global_vars, local_vars) # execute the script > File "C:\Program Files\JetBrains\PyCharm 2020.1.1\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile > exec(compile(contents+"\n", file, 'exec'), glob, loc) > File "C:/Users/ASUS/PycharmProjects/swelltest/enemble.py", line 112, in > ens.fit(x_train,y_train) > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_voting.py", line 265, in fit > return super().fit(X, transformed_y, sample_weight) > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_voting.py", line 65, in fit > names, clfs = self._validate_estimators() > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_base.py", line 228, in _validate_estimators > self._validate_names(names) > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\utils\metaestimators.py", line 77, in _validate_names > invalid_names = [name for name in names if '__' in name] > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\utils\metaestimators.py", line 77, in > invalid_names = [name for name in names if '__' in name] > TypeError: argument of type 'ColumnTransformer' is not iterable > > Following are the inputs in 'names' list at the time of the error. > > 1- ColumnTransformer(transformers=[('phy', Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]), ['HR', 'RMSSD', 'SCL'])]) > 2- ColumnTransformer(transformers=[('fa',Pipeline(steps=[('imputer',SimpleImputer(strategy='median')),('scaler', StandardScaler())]),['Squality', 'Sneutral', 'Shappy'])]) > > Seems like that the library is attempting to search for '__' substring of the ColumnTransform object, which it is unable to perform. > > Since this name check doesn't have a signiticant effect on my functionality, I commented following snippet at sklearn\utils\metaestimators.py. > > invalid_names = [name for name in names if '__' in name] > if invalid_names: > raise ValueError('Estimator names must not contain __: got ' > '{0!r}'.format(invalid_names)) > > Please let me know if there is a better workaround or that their are any issues of commenting out this code. > > Thanks > On Fri, May 29, 2020 at 10:33 AM Chamila Wijayarathna wrote: > > Hello all, > > > > I hope I am writing to the correct mailing list about this issue that I am having. Please apologize me if I am not. > > > > I am attempting to use a pipeline to feed an ensemble voting classifier as I want the ensemble learner to use models that train on different feature sets. For this purpose, I followed the tutorial available at [1]. > > > > Following is the code that I could develop so far. > > > > y = df1.index > > x = preprocessing.scale(df1) > > > > phy_features = ['A', 'B', 'C'] > > phy_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]) > > phy_processer = ColumnTransformer(transformers=[('phy', phy_transformer, phy_features)]) > > > > fa_features = ['D', 'E', 'F'] > > fa_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]) > > fa_processer = ColumnTransformer(transformers=[('fa', fa_transformer, fa_features)]) > > > > > > pipe_phy = Pipeline(steps=[('preprocessor', phy_processer ),('classifier', SVM)]) > > pipe_fa = Pipeline(steps=[('preprocessor', fa_processer ),('classifier', SVM)]) > > > > ens = VotingClassifier(estimators=[pipe_phy, pipe_fa]) > > > > cv = KFold(n_splits=10, random_state=None, shuffle=True) > > for train_index, test_index in cv.split(x): > > x_train, x_test = x[train_index], x[test_index] > > y_train, y_test = y[train_index], y[test_index] > > ens.fit(x_train,y_train) > > print(ens.score(x_test, y_test)) > > > > However, when running the code, I am getting an error saying TypeError: argument of type 'ColumnTransformer' is not iterable, at the line ens.fit(x_train,y_train). > > > > What is the reason for this and how can I fix it? > > > > Thank you, > > Chamila > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > Chamila Dilshan Wijayarathna, > PhD Research Student > The University of New South Wales (UNSW Canberra) > Australian Centre for Cyber Security > Australian Defence Force Academy > PO Box 7916, Canberra BA ACT 2610 > Australia > Mobile:(+61)416895795 > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From andreasmuellerml at gmail.com Fri May 29 10:21:55 2020 From: andreasmuellerml at gmail.com (Andreas Mueller) Date: Fri, 29 May 2020 10:21:55 -0400 Subject: [scikit-learn] major league hacking summer internship program In-Reply-To: References: <066301d62dfc$af92cb90$0eb862b0$@gmail.com> <050001d63537$b4fdbe90$1ef93bb0$@gmail.com> Message-ID: Thanks folks! That gives us a good start I think! Re documentation: honestly I'm not entirely sure if those are good issues because I'm not sure if we have consensus what we want to recommend. We can certainly include these but they require some decisions and a lot of expertise. Maybe we can discuss further issues either here or on gitter? Andy On Fri, May 29, 2020, 09:45 Thomas J Fan wrote: > I can commit to reviewing. Diving into their program, it looks like they > are hiring supervisers through: https://raise.dev/Apply/?ref=mlh which is > titled "Software Developer Coach". By looking at their > https://fellowship.mlh.io/students they have about 9 weeks of actual > contributing. > > Given they have an engineer to help, maybe they can work on the > documenting the production aspects: > > 1. Roadmap item 19: Documentation and tooling for model lifecycle > management > 2. Roadmap item 21: Document good practices to detect temporal distrubiton > drift > > Regards, > Thomas > > On Thursday, May 28, 2020 at 5:36 PM, Andreas C. Mueller < > andreasmuellerml at gmail.com> wrote: > > Hi Folks. > > So this program sounds pretty cool. They preselected some people for an ML > work group, who will be doing daily standups together and pair programming, > and who might move around between some related projects over the 12 weeks > of the program. > > They made sure to get a diverse set of students and they have an engineer > that will supervised them. > > They would probably have 2-3 students working on sklearn. > > They don?t expect having one big feature but they do expect some guidance > on what issues to work on. > > Also, the program starts on Monday, and they start contributing to OSS > projects about a week after that. > > Ideally we?d tell them if we?re in or not before Monday, and have a > tentative list of issues / projects. > > > > What do you all think? Also, if we want to do it, who would have cycles > for some reviewing? > > This seems to be well organized and they seem to have put quite some > thought into it, but we do need to do a little bit of work on our end. > > I can try picking some issues but I probably can?t commit a lot of > reviewing time. > > > > Cheers, > > Andy > > > > > > *From:* scikit-learn *On > Behalf Of *Adrin > *Sent:* Tuesday, May 19, 2020 3:42 PM > *To:* Scikit-learn mailing list > *Subject:* Re: [scikit-learn] major league hacking summer internship > program > > > > Sounds pretty cool to me. > > > > On Tue, May 19, 2020 at 6:45 PM wrote: > > Hey Folks. > > This program reached out to me: > > > https://news.mlh.io/mlh-fellowship-the-future-of-tech-internships-05-04-2020 > > > > What do you think? > > Sounds like GSOC but with extra mentorship, so it might be a good fit for > us? > > I would say it depends on what level of involvement they require from us. > > > > Best, > > Andy > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cdwijayarathna at gmail.com Fri May 29 11:42:56 2020 From: cdwijayarathna at gmail.com (Chamila Wijayarathna) Date: Fri, 29 May 2020 21:12:56 +0530 Subject: [scikit-learn] sklearn Pipeline: argument of type 'ColumnTransformer' is not iterable In-Reply-To: <52354c90-c217-42e4-86d7-77d1eb8c30a9@Canary> References: <52354c90-c217-42e4-86d7-77d1eb8c30a9@Canary> Message-ID: Hi, Thanks, this solution fixed the issue. However, it introduces a new error, which was not there before. Traceback (most recent call last): File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\utils\__init__.py", line 425, in _get_column_indices all_columns = X.columns AttributeError: 'numpy.ndarray' object has no attribute 'columns' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "", line 1, in File "C:\Program Files\JetBrains\PyCharm 2020.1.1\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile pydev_imports.execfile(filename, global_vars, local_vars) # execute the script File "C:\Program Files\JetBrains\PyCharm 2020.1.1\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "C:/Users/ASUS/PycharmProjects/swelltest/enemble.py", line 127, in ens.fit(x_train,y_train) File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_voting.py", line 265, in fit return super().fit(X, transformed_y, sample_weight) File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_voting.py", line 81, in fit for idx, clf in enumerate(clfs) if clf not in (None, 'drop') File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\parallel.py", line 1029, in __call__ if self.dispatch_one_batch(iterator): File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\parallel.py", line 847, in dispatch_one_batch self._dispatch(tasks) File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\parallel.py", line 765, in _dispatch job = self._backend.apply_async(batch, callback=cb) File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\_parallel_backends.py", line 206, in apply_async result = ImmediateResult(func) File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\_parallel_backends.py", line 570, in __init__ self.results = batch() File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\parallel.py", line 253, in __call__ for func, args, kwargs in self.items] File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\parallel.py", line 253, in for func, args, kwargs in self.items] File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_base.py", line 40, in _fit_single_estimator estimator.fit(X, y) File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\pipeline.py", line 330, in fit Xt = self._fit(X, y, **fit_params_steps) File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\pipeline.py", line 296, in _fit **fit_params_steps[name]) File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\memory.py", line 352, in __call__ return self.func(*args, **kwargs) File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\pipeline.py", line 740, in _fit_transform_one res = transformer.fit_transform(X, y, **fit_params) File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\compose\_column_transformer.py", line 529, in fit_transform self._validate_remainder(X) File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\compose\_column_transformer.py", line 327, in _validate_remainder cols.extend(_get_column_indices(X, columns)) File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\utils\__init__.py", line 427, in _get_column_indices raise ValueError("Specifying the columns using strings is only " ValueError: Specifying the columns using strings is only supported for pandas DataFrames Thanks On Fri, May 29, 2020 at 7:33 PM Thomas J Fan wrote: > VotingClassifer also needs names: > > ens = VotingClassifier(estimators=[('pipe1', pipe_phy), ('pipe2', > pipe_fa)]) > > Thomas > > On Friday, May 29, 2020 at 2:33 AM, Chamila Wijayarathna < > cdwijayarathna at gmail.com> wrote: > Hi all, > > I did manage to get the code to run using a workaround, which is bit ugly. > > Following is the complete stacktrace of the error I was receiving. > > > > > > > > > > > > > > > > > > > > *Traceback (most recent call last): File "", line 1, in > File "C:\Program Files\JetBrains\PyCharm > 2020.1.1\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line > 197, in runfile pydev_imports.execfile(filename, global_vars, > local_vars) # execute the script File "C:\Program Files\JetBrains\PyCharm > 2020.1.1\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line > 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) > File "C:/Users/ASUS/PycharmProjects/swelltest/enemble.py", line 112, in > ens.fit(x_train,y_train) File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_voting.py", > line 265, in fit return super().fit(X, transformed_y, sample_weight) > File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_voting.py", > line 65, in fit names, clfs = self._validate_estimators() File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_base.py", > line 228, in _validate_estimators self._validate_names(names) File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\utils\metaestimators.py", > line 77, in _validate_names invalid_names = [name for name in names if > '__' in name] File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\utils\metaestimators.py", > line 77, in invalid_names = [name for name in names if '__' > in name]TypeError: argument of type 'ColumnTransformer' is not iterable* > > Following are the inputs in 'names' list at the time of the error. > > 1- > *ColumnTransformer(transformers=[('phy', Pipeline(steps=[('imputer', > SimpleImputer(strategy='median')), ('scaler', StandardScaler())]), ['HR', > 'RMSSD', 'SCL'])])2- > ColumnTransformer(transformers=[('fa',Pipeline(steps=[('imputer',SimpleImputer(strategy='median')),('scaler', > StandardScaler())]),['Squality', 'Sneutral', 'Shappy'])])* > > Seems like that the library is attempting to search for '__' substring of > the ColumnTransform object, which it is unable to perform. > > Since this name check doesn't have a signiticant effect on my > functionality, I commented following snippet at > *sklearn\utils\metaestimators.py.* > > > > > *invalid_names = [name for name in names if '__' in name]if > invalid_names: raise ValueError('Estimator names must not contain __: > got ' '{0!r}'.format(invalid_names))* > > Please let me know if there is a better workaround or that their are any > issues of commenting out this code. > > Thanks > > On Fri, May 29, 2020 at 10:33 AM Chamila Wijayarathna < > cdwijayarathna at gmail.com> wrote: > >> Hello all, >> >> I hope I am writing to the correct mailing list about this issue that I >> am having. Please apologize me if I am not. >> >> I am attempting to use a pipeline to feed an ensemble voting classifier >> as I want the ensemble learner to use models that train on different >> feature sets. For this purpose, I followed the tutorial available at [1]. >> >> Following is the code that I could develop so far. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> *y = df1.indexx = preprocessing.scale(df1)phy_features = ['A', 'B', >> 'C']phy_transformer = Pipeline(steps=[('imputer', >> SimpleImputer(strategy='median')), ('scaler', >> StandardScaler())])phy_processer = ColumnTransformer(transformers=[('phy', >> phy_transformer, phy_features)])fa_features = ['D', 'E', 'F']fa_transformer >> = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', >> StandardScaler())])fa_processer = ColumnTransformer(transformers=[('fa', >> fa_transformer, fa_features)])pipe_phy = Pipeline(steps=[('preprocessor', >> phy_processer ),('classifier', SVM)])pipe_fa = >> Pipeline(steps=[('preprocessor', fa_processer ),('classifier', SVM)])ens = >> VotingClassifier(estimators=[pipe_phy, pipe_fa])cv = KFold(n_splits=10, >> random_state=None, shuffle=True)for train_index, test_index in >> cv.split(x): x_train, x_test = x[train_index], x[test_index] y_train, >> y_test = y[train_index], y[test_index] ens.fit(x_train,y_train) >> print(ens.score(x_test, y_test))* >> >> However, when running the code, I am getting an error saying *TypeError: >> argument of type 'ColumnTransformer' is not iterable*, at the line >> *ens.fit(x_train,y_train).* >> >> What is the reason for this and how can I fix it? >> >> Thank you, >> Chamila >> > > > -- > Chamila Dilshan Wijayarathna, > PhD Research Student > The University of New South Wales (UNSW Canberra) > Australian Centre for Cyber Security > Australian Defence Force Academy > PO Box 7916, Canberra BA ACT 2610 > Australia > Mobile:(+61)416895795 > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Chamila Dilshan Wijayarathna, PhD Research Student The University of New South Wales (UNSW Canberra) Australian Centre for Cyber Security Australian Defence Force Academy PO Box 7916, Canberra BA ACT 2610 Australia Mobile:(+61)416895795 -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomasjpfan at gmail.com Fri May 29 11:52:03 2020 From: thomasjpfan at gmail.com (Thomas J Fan) Date: Fri, 29 May 2020 11:52:03 -0400 Subject: [scikit-learn] sklearn Pipeline: argument of type 'ColumnTransformer' is not iterable In-Reply-To: References: <52354c90-c217-42e4-86d7-77d1eb8c30a9@Canary> Message-ID: Once x = preprocessing.scale(df1) is called, the input to your estimator is no longer a dataframe, so the column transformer can not use strings to select columns. Thomas > On Friday, May 29, 2020 at 11:46 AM, Chamila Wijayarathna wrote: > Hi, > > Thanks, this solution fixed the issue. However, it introduces a new error, which was not there before. > > Traceback (most recent call last): > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\utils\__init__.py", line 425, in _get_column_indices > all_columns = X.columns > AttributeError: 'numpy.ndarray' object has no attribute 'columns' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "", line 1, in > File "C:\Program Files\JetBrains\PyCharm 2020.1.1\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile > pydev_imports.execfile(filename, global_vars, local_vars) # execute the script > File "C:\Program Files\JetBrains\PyCharm 2020.1.1\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile > exec(compile(contents+"\n", file, 'exec'), glob, loc) > File "C:/Users/ASUS/PycharmProjects/swelltest/enemble.py", line 127, in > ens.fit(x_train,y_train) > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_voting.py", line 265, in fit > return super().fit(X, transformed_y, sample_weight) > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_voting.py", line 81, in fit > for idx, clf in enumerate(clfs) if clf not in (None, 'drop') > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\parallel.py", line 1029, in __call__ > if self.dispatch_one_batch(iterator): > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\parallel.py", line 847, in dispatch_one_batch > self._dispatch(tasks) > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\parallel.py", line 765, in _dispatch > job = self._backend.apply_async(batch, callback=cb) > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\_parallel_backends.py", line 206, in apply_async > result = ImmediateResult(func) > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\_parallel_backends.py", line 570, in __init__ > self.results = batch() > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\parallel.py", line 253, in __call__ > for func, args, kwargs in self.items] > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\parallel.py", line 253, in > for func, args, kwargs in self.items] > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_base.py", line 40, in _fit_single_estimator > estimator.fit(X, y) > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\pipeline.py", line 330, in fit > Xt = self._fit(X, y, **fit_params_steps) > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\pipeline.py", line 296, in _fit > **fit_params_steps[name]) > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\memory.py", line 352, in __call__ > return self.func(*args, **kwargs) > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\pipeline.py", line 740, in _fit_transform_one > res = transformer.fit_transform(X, y, **fit_params) > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\compose\_column_transformer.py", line 529, in fit_transform > self._validate_remainder(X) > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\compose\_column_transformer.py", line 327, in _validate_remainder > cols.extend(_get_column_indices(X, columns)) > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\utils\__init__.py", line 427, in _get_column_indices > raise ValueError("Specifying the columns using strings is only " > ValueError: Specifying the columns using strings is only supported for pandas DataFrames > > Thanks > On Fri, May 29, 2020 at 7:33 PM Thomas J Fan wrote: > > VotingClassifer also needs names: > > > > ens = VotingClassifier(estimators=[('pipe1', pipe_phy), ('pipe2', pipe_fa)]) > > > > Thomas > > > > > On Friday, May 29, 2020 at 2:33 AM, Chamila Wijayarathna wrote: > > > Hi all, > > > > > > I did manage to get the code to run using a workaround, which is bit ugly. > > > > > > Following is the complete stacktrace of the error I was receiving. > > > > > > Traceback (most recent call last): > > > File "", line 1, in > > > File "C:\Program Files\JetBrains\PyCharm 2020.1.1\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile > > > pydev_imports.execfile(filename, global_vars, local_vars) # execute the script > > > File "C:\Program Files\JetBrains\PyCharm 2020.1.1\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile > > > exec(compile(contents+"\n", file, 'exec'), glob, loc) > > > File "C:/Users/ASUS/PycharmProjects/swelltest/enemble.py", line 112, in > > > ens.fit(x_train,y_train) > > > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_voting.py", line 265, in fit > > > return super().fit(X, transformed_y, sample_weight) > > > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_voting.py", line 65, in fit > > > names, clfs = self._validate_estimators() > > > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_base.py", line 228, in _validate_estimators > > > self._validate_names(names) > > > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\utils\metaestimators.py", line 77, in _validate_names > > > invalid_names = [name for name in names if '__' in name] > > > File "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\utils\metaestimators.py", line 77, in > > > invalid_names = [name for name in names if '__' in name] > > > TypeError: argument of type 'ColumnTransformer' is not iterable > > > > > > Following are the inputs in 'names' list at the time of the error. > > > > > > 1- ColumnTransformer(transformers=[('phy', Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]), ['HR', 'RMSSD', 'SCL'])]) > > > 2- ColumnTransformer(transformers=[('fa',Pipeline(steps=[('imputer',SimpleImputer(strategy='median')),('scaler', StandardScaler())]),['Squality', 'Sneutral', 'Shappy'])]) > > > > > > Seems like that the library is attempting to search for '__' substring of the ColumnTransform object, which it is unable to perform. > > > > > > Since this name check doesn't have a signiticant effect on my functionality, I commented following snippet at sklearn\utils\metaestimators.py. > > > > > > invalid_names = [name for name in names if '__' in name] > > > if invalid_names: > > > raise ValueError('Estimator names must not contain __: got ' > > > '{0!r}'.format(invalid_names)) > > > > > > Please let me know if there is a better workaround or that their are any issues of commenting out this code. > > > > > > Thanks > > > On Fri, May 29, 2020 at 10:33 AM Chamila Wijayarathna wrote: > > > > Hello all, > > > > > > > > I hope I am writing to the correct mailing list about this issue that I am having. Please apologize me if I am not. > > > > > > > > I am attempting to use a pipeline to feed an ensemble voting classifier as I want the ensemble learner to use models that train on different feature sets. For this purpose, I followed the tutorial available at [1]. > > > > > > > > Following is the code that I could develop so far. > > > > > > > > y = df1.index > > > > x = preprocessing.scale(df1) > > > > > > > > phy_features = ['A', 'B', 'C'] > > > > phy_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]) > > > > phy_processer = ColumnTransformer(transformers=[('phy', phy_transformer, phy_features)]) > > > > > > > > fa_features = ['D', 'E', 'F'] > > > > fa_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())]) > > > > fa_processer = ColumnTransformer(transformers=[('fa', fa_transformer, fa_features)]) > > > > > > > > > > > > pipe_phy = Pipeline(steps=[('preprocessor', phy_processer ),('classifier', SVM)]) > > > > pipe_fa = Pipeline(steps=[('preprocessor', fa_processer ),('classifier', SVM)]) > > > > > > > > ens = VotingClassifier(estimators=[pipe_phy, pipe_fa]) > > > > > > > > cv = KFold(n_splits=10, random_state=None, shuffle=True) > > > > for train_index, test_index in cv.split(x): > > > > x_train, x_test = x[train_index], x[test_index] > > > > y_train, y_test = y[train_index], y[test_index] > > > > ens.fit(x_train,y_train) > > > > print(ens.score(x_test, y_test)) > > > > > > > > However, when running the code, I am getting an error saying TypeError: argument of type 'ColumnTransformer' is not iterable, at the line ens.fit(x_train,y_train). > > > > > > > > What is the reason for this and how can I fix it? > > > > > > > > Thank you, > > > > Chamila > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Chamila Dilshan Wijayarathna, > > > PhD Research Student > > > The University of New South Wales (UNSW Canberra) > > > Australian Centre for Cyber Security > > > Australian Defence Force Academy > > > PO Box 7916, Canberra BA ACT 2610 > > > Australia > > > Mobile:(+61)416895795 > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org (mailto:scikit-learn at python.org) > > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org (mailto:scikit-learn at python.org) > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > Chamila Dilshan Wijayarathna, > PhD Research Student > The University of New South Wales (UNSW Canberra) > Australian Centre for Cyber Security > Australian Defence Force Academy > PO Box 7916, Canberra BA ACT 2610 > Australia > Mobile:(+61)416895795 > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From niourf at gmail.com Fri May 29 12:25:58 2020 From: niourf at gmail.com (Nicolas Hug) Date: Fri, 29 May 2020 12:25:58 -0400 Subject: [scikit-learn] sklearn Pipeline: argument of type 'ColumnTransformer' is not iterable In-Reply-To: References: <52354c90-c217-42e4-86d7-77d1eb8c30a9@Canary> Message-ID: Also, you should not scale your input before computing cross-validation scores. By doing that you are biasing your results because each test set knows something about the rest of the data (even if it's not target data) The scaling should be applied independently on each (train / test) pair. This can be done through a pipeline: https://scikit-learn.org/stable/modules/compose.html On 5/29/20 11:52 AM, Thomas J Fan wrote: > Once > > /x = preprocessing.scale(df1)/ > / > / > is called, the input to your estimator is no longer a dataframe, so > the column transformer can not use strings to select columns. > > Thomas > > On Friday, May 29, 2020 at 11:46 AM, Chamila Wijayarathna > > wrote: > Hi, > > Thanks, this solution fixed the issue. However, it introduces a > new error, which was not there before. > > Traceback (most recent call last): > ? File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\utils\__init__.py", > line 425, in _get_column_indices > ? ? all_columns = X.columns > AttributeError: 'numpy.ndarray' object has no attribute 'columns' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > ? File "", line 1, in > ? File "C:\Program Files\JetBrains\PyCharm > 2020.1.1\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", > line 197, in runfile > ? ? pydev_imports.execfile(filename, global_vars, local_vars) ?# > execute the script > ? File "C:\Program Files\JetBrains\PyCharm > 2020.1.1\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", > line 18, in execfile > ? ? exec(compile(contents+"\n", file, 'exec'), glob, loc) > ? File "C:/Users/ASUS/PycharmProjects/swelltest/enemble.py", line > 127, in > ? ? ens.fit(x_train,y_train) > ? File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_voting.py", > line 265, in fit > ? ? return super().fit(X, transformed_y, sample_weight) > ? File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_voting.py", > line 81, in fit > ? ? for idx, clf in enumerate(clfs) if clf not in (None, 'drop') > ? File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\parallel.py", > line 1029, in __call__ > ? ? if self.dispatch_one_batch(iterator): > ? File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\parallel.py", > line 847, in dispatch_one_batch > ? ? self._dispatch(tasks) > ? File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\parallel.py", > line 765, in _dispatch > ? ? job = self._backend.apply_async(batch, callback=cb) > ? File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\_parallel_backends.py", > line 206, in apply_async > ? ? result = ImmediateResult(func) > ? File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\_parallel_backends.py", > line 570, in __init__ > ? ? self.results = batch() > ? File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\parallel.py", > line 253, in __call__ > ? ? for func, args, kwargs in self.items] > ? File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\parallel.py", > line 253, in > ? ? for func, args, kwargs in self.items] > ? File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_base.py", > line 40, in _fit_single_estimator > ? ? estimator.fit(X, y) > ? File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\pipeline.py", > line 330, in fit > ? ? Xt = self._fit(X, y, **fit_params_steps) > ? File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\pipeline.py", > line 296, in _fit > ? ? **fit_params_steps[name]) > ? File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\memory.py", > line 352, in __call__ > ? ? return self.func(*args, **kwargs) > ? File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\pipeline.py", > line 740, in _fit_transform_one > ? ? res = transformer.fit_transform(X, y, **fit_params) > ? File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\compose\_column_transformer.py", > line 529, in fit_transform > ? ? self._validate_remainder(X) > ? File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\compose\_column_transformer.py", > line 327, in _validate_remainder > ? ? cols.extend(_get_column_indices(X, columns)) > ? File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\utils\__init__.py", > line 427, in _get_column_indices > ? ? raise ValueError("Specifying the columns using strings is only " > ValueError: Specifying the columns using strings is only supported > for pandas DataFrames > > Thanks > > On Fri, May 29, 2020 at 7:33 PM Thomas J Fan > > wrote: > > VotingClassifer also needs names: > > ens = VotingClassifier(estimators=[('pipe1', pipe_phy), > ('pipe2', pipe_fa)]) > > Thomas > > On Friday, May 29, 2020 at 2:33 AM, Chamila Wijayarathna > > wrote: > Hi all, > > I did manage to get the code to run using a workaround, > which is bit ugly. > > Following is the complete stacktrace of the error I was > receiving. > > /Traceback (most recent call last): > ? File "", line 1, in > ? File "C:\Program Files\JetBrains\PyCharm > 2020.1.1\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", > line 197, in runfile > ? ? pydev_imports.execfile(filename, global_vars, > local_vars) ?# execute the script > ? File "C:\Program Files\JetBrains\PyCharm > 2020.1.1\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", > line 18, in execfile > ? ? exec(compile(contents+"\n", file, 'exec'), glob, loc) > ? File > "C:/Users/ASUS/PycharmProjects/swelltest/enemble.py", line > 112, in > ? ? ens.fit(x_train,y_train) > ? File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_voting.py", > line 265, in fit > ? ? return super().fit(X, transformed_y, sample_weight) > ? File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_voting.py", > line 65, in fit > ? ? names, clfs = self._validate_estimators() > ? File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_base.py", > line 228, in _validate_estimators > ? ? self._validate_names(names) > ? File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\utils\metaestimators.py", > line 77, in _validate_names > ? ? invalid_names = [name for name in names if '__' in name] > ? File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\utils\metaestimators.py", > line 77, in > ? ? invalid_names = [name for name in names if '__' in name] > TypeError: argument of type 'ColumnTransformer' is not > iterable/ > / > / > Following are the inputs in 'names' list at the time of > the error. > > 1- /ColumnTransformer(transformers=[('phy', > Pipeline(steps=[('imputer', > SimpleImputer(strategy='median')), ('scaler', > StandardScaler())]),?['HR', 'RMSSD', 'SCL'])]) > 2- > ColumnTransformer(transformers=[('fa',Pipeline(steps=[('imputer',SimpleImputer(strategy='median')),('scaler', > StandardScaler())]),['Squality', 'Sneutral', 'Shappy'])])/ > > Seems like that the library is attempting to search for > '__' substring of the ColumnTransform object, which it is > unable to perform. > > Since this name check doesn't have a signiticant effect on > my functionality, I commented following snippet at > /sklearn\utils\metaestimators.py./ > > /invalid_names = [name for name in names if '__' in name] > if invalid_names: > ? ? raise ValueError('Estimator names must not contain __: > got ' > '{0!r}'.format(invalid_names))/ > > Please let me know if there is a better workaround or that > their are any issues of commenting out this code. > > Thanks > > On Fri, May 29, 2020 at 10:33 AM Chamila Wijayarathna > > wrote: > > Hello all, > > I hope I am writing to the correct mailing list about > this issue that I am having. Please apologize me if I > am not. > > I am attempting to use a pipeline to feed an ensemble > voting classifier as I want the ensemble learner to > use models that train on different feature sets. For > this purpose, I followed the tutorial available at [1]. > > Following is the code that I could develop so far. > > /y = df1.index > x = preprocessing.scale(df1) > > phy_features = ['A', 'B', 'C'] > phy_transformer = Pipeline(steps=[('imputer', > SimpleImputer(strategy='median')), ('scaler', > StandardScaler())]) > phy_processer = > ColumnTransformer(transformers=[('phy', > phy_transformer, phy_features)]) > > fa_features = ['D', 'E', 'F'] > fa_transformer = Pipeline(steps=[('imputer', > SimpleImputer(strategy='median')), ('scaler', > StandardScaler())]) > fa_processer = ColumnTransformer(transformers=[('fa', > fa_transformer, fa_features)]) > > > pipe_phy = Pipeline(steps=[('preprocessor', > phy_processer ),('classifier', SVM)]) > pipe_fa = Pipeline(steps=[('preprocessor', > fa_processer ),('classifier', SVM)]) > > ens = VotingClassifier(estimators=[pipe_phy, pipe_fa]) > > cv = KFold(n_splits=10, random_state=None, shuffle=True) > for train_index, test_index in cv.split(x): > ? ? x_train, x_test = x[train_index], x[test_index] > ? ? y_train, y_test = y[train_index], y[test_index] > ? ? ens.fit(x_train,y_train) > ? ? print(ens.score(x_test, y_test))/ > / > /However, when running the code, I am getting an error > saying /TypeError: argument of type > 'ColumnTransformer' is not iterable/, at the line > /ens.fit(x_train,y_train)./ > > What is the reason for this and how can I fix it? > > Thank you, > Chamila > > > > -- > Chamila Dilshan Wijayarathna, > PhD Research Student > The University of New South Wales (UNSW Canberra) > Australian Centre for Cyber Security > Australian Defence Force Academy > PO Box 7916, Canberra BA ACT 2610 > Australia > Mobile:(+61)416895795 > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > -- > Chamila Dilshan Wijayarathna, > PhD Research Student > The University of New South Wales (UNSW Canberra) > Australian Centre for Cyber Security > Australian Defence Force Academy > PO Box 7916, Canberra BA ACT 2610 > Australia > Mobile:(+61)416895795 > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From cdwijayarathna at gmail.com Sat May 30 01:20:31 2020 From: cdwijayarathna at gmail.com (Chamila Wijayarathna) Date: Sat, 30 May 2020 10:50:31 +0530 Subject: [scikit-learn] sklearn Pipeline: argument of type 'ColumnTransformer' is not iterable In-Reply-To: References: <52354c90-c217-42e4-86d7-77d1eb8c30a9@Canary> Message-ID: Thank you both for your inputs. On Fri, May 29, 2020 at 9:57 PM Nicolas Hug wrote: > Also, you should not scale your input before computing cross-validation > scores. By doing that you are biasing your results because each test set > knows something about the rest of the data (even if it's not target data) > > The scaling should be applied independently on each (train / test) pair. > > This can be done through a pipeline: > https://scikit-learn.org/stable/modules/compose.html > > > On 5/29/20 11:52 AM, Thomas J Fan wrote: > > Once > > *x = preprocessing.scale(df1)* > > is called, the input to your estimator is no longer a dataframe, so the > column transformer can not use strings to select columns. > > Thomas > > On Friday, May 29, 2020 at 11:46 AM, Chamila Wijayarathna < > cdwijayarathna at gmail.com> wrote: > Hi, > > Thanks, this solution fixed the issue. However, it introduces a new error, > which was not there before. > > Traceback (most recent call last): > File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\utils\__init__.py", > line 425, in _get_column_indices > all_columns = X.columns > AttributeError: 'numpy.ndarray' object has no attribute 'columns' > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "", line 1, in > File "C:\Program Files\JetBrains\PyCharm > 2020.1.1\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line > 197, in runfile > pydev_imports.execfile(filename, global_vars, local_vars) # execute > the script > File "C:\Program Files\JetBrains\PyCharm > 2020.1.1\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line > 18, in execfile > exec(compile(contents+"\n", file, 'exec'), glob, loc) > File "C:/Users/ASUS/PycharmProjects/swelltest/enemble.py", line 127, in > > ens.fit(x_train,y_train) > File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_voting.py", > line 265, in fit > return super().fit(X, transformed_y, sample_weight) > File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_voting.py", > line 81, in fit > for idx, clf in enumerate(clfs) if clf not in (None, 'drop') > File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\parallel.py", > line 1029, in __call__ > if self.dispatch_one_batch(iterator): > File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\parallel.py", > line 847, in dispatch_one_batch > self._dispatch(tasks) > File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\parallel.py", > line 765, in _dispatch > job = self._backend.apply_async(batch, callback=cb) > File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\_parallel_backends.py", > line 206, in apply_async > result = ImmediateResult(func) > File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\_parallel_backends.py", > line 570, in __init__ > self.results = batch() > File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\parallel.py", > line 253, in __call__ > for func, args, kwargs in self.items] > File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\parallel.py", > line 253, in > for func, args, kwargs in self.items] > File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_base.py", > line 40, in _fit_single_estimator > estimator.fit(X, y) > File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\pipeline.py", > line 330, in fit > Xt = self._fit(X, y, **fit_params_steps) > File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\pipeline.py", > line 296, in _fit > **fit_params_steps[name]) > File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\joblib\memory.py", > line 352, in __call__ > return self.func(*args, **kwargs) > File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\pipeline.py", > line 740, in _fit_transform_one > res = transformer.fit_transform(X, y, **fit_params) > File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\compose\_column_transformer.py", > line 529, in fit_transform > self._validate_remainder(X) > File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\compose\_column_transformer.py", > line 327, in _validate_remainder > cols.extend(_get_column_indices(X, columns)) > File > "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\utils\__init__.py", > line 427, in _get_column_indices > raise ValueError("Specifying the columns using strings is only " > ValueError: Specifying the columns using strings is only supported for > pandas DataFrames > > Thanks > > On Fri, May 29, 2020 at 7:33 PM Thomas J Fan > wrote: > >> VotingClassifer also needs names: >> >> ens = VotingClassifier(estimators=[('pipe1', pipe_phy), ('pipe2', >> pipe_fa)]) >> >> Thomas >> >> On Friday, May 29, 2020 at 2:33 AM, Chamila Wijayarathna < >> cdwijayarathna at gmail.com> wrote: >> Hi all, >> >> I did manage to get the code to run using a workaround, which is bit ugly. >> >> Following is the complete stacktrace of the error I was receiving. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> *Traceback (most recent call last): File "", line 1, in >> File "C:\Program Files\JetBrains\PyCharm >> 2020.1.1\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line >> 197, in runfile pydev_imports.execfile(filename, global_vars, >> local_vars) # execute the script File "C:\Program >> Files\JetBrains\PyCharm >> 2020.1.1\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line >> 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) >> File "C:/Users/ASUS/PycharmProjects/swelltest/enemble.py", line 112, in >> ens.fit(x_train,y_train) File >> "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_voting.py", >> line 265, in fit return super().fit(X, transformed_y, sample_weight) >> File >> "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_voting.py", >> line 65, in fit names, clfs = self._validate_estimators() File >> "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\ensemble\_base.py", >> line 228, in _validate_estimators self._validate_names(names) File >> "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\utils\metaestimators.py", >> line 77, in _validate_names invalid_names = [name for name in names if >> '__' in name] File >> "C:\Users\ASUS\PycharmProjects\swelltest\venv\lib\site-packages\sklearn\utils\metaestimators.py", >> line 77, in invalid_names = [name for name in names if '__' >> in name] TypeError: argument of type 'ColumnTransformer' is not iterable* >> >> Following are the inputs in 'names' list at the time of the error. >> >> 1- >> *ColumnTransformer(transformers=[('phy', Pipeline(steps=[('imputer', >> SimpleImputer(strategy='median')), ('scaler', StandardScaler())]), ['HR', >> 'RMSSD', 'SCL'])]) 2- >> ColumnTransformer(transformers=[('fa',Pipeline(steps=[('imputer',SimpleImputer(strategy='median')),('scaler', >> StandardScaler())]),['Squality', 'Sneutral', 'Shappy'])])* >> >> Seems like that the library is attempting to search for '__' substring of >> the ColumnTransform object, which it is unable to perform. >> >> Since this name check doesn't have a signiticant effect on my >> functionality, I commented following snippet at >> *sklearn\utils\metaestimators.py.* >> >> >> >> >> *invalid_names = [name for name in names if '__' in name] if >> invalid_names: raise ValueError('Estimator names must not contain __: >> got ' '{0!r}'.format(invalid_names))* >> >> Please let me know if there is a better workaround or that their are any >> issues of commenting out this code. >> >> Thanks >> >> On Fri, May 29, 2020 at 10:33 AM Chamila Wijayarathna < >> cdwijayarathna at gmail.com> wrote: >> >>> Hello all, >>> >>> I hope I am writing to the correct mailing list about this issue that I >>> am having. Please apologize me if I am not. >>> >>> I am attempting to use a pipeline to feed an ensemble voting classifier >>> as I want the ensemble learner to use models that train on different >>> feature sets. For this purpose, I followed the tutorial available at [1]. >>> >>> Following is the code that I could develop so far. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> *y = df1.index x = preprocessing.scale(df1) phy_features = ['A', 'B', >>> 'C'] phy_transformer = Pipeline(steps=[('imputer', >>> SimpleImputer(strategy='median')), ('scaler', StandardScaler())]) >>> phy_processer = ColumnTransformer(transformers=[('phy', phy_transformer, >>> phy_features)]) fa_features = ['D', 'E', 'F'] fa_transformer = >>> Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', >>> StandardScaler())]) fa_processer = ColumnTransformer(transformers=[('fa', >>> fa_transformer, fa_features)]) pipe_phy = Pipeline(steps=[('preprocessor', >>> phy_processer ),('classifier', SVM)]) pipe_fa = >>> Pipeline(steps=[('preprocessor', fa_processer ),('classifier', SVM)]) ens = >>> VotingClassifier(estimators=[pipe_phy, pipe_fa]) cv = KFold(n_splits=10, >>> random_state=None, shuffle=True) for train_index, test_index in >>> cv.split(x): x_train, x_test = x[train_index], x[test_index] >>> y_train, y_test = y[train_index], y[test_index] >>> ens.fit(x_train,y_train) print(ens.score(x_test, y_test))* >>> >>> However, when running the code, I am getting an error saying *TypeError: >>> argument of type 'ColumnTransformer' is not iterable*, at the line >>> *ens.fit(x_train,y_train).* >>> >>> What is the reason for this and how can I fix it? >>> >>> Thank you, >>> Chamila >>> >> >> >> -- >> Chamila Dilshan Wijayarathna, >> PhD Research Student >> The University of New South Wales (UNSW Canberra) >> Australian Centre for Cyber Security >> Australian Defence Force Academy >> PO Box 7916, Canberra BA ACT 2610 >> Australia >> Mobile:(+61)416895795 >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > Chamila Dilshan Wijayarathna, > PhD Research Student > The University of New South Wales (UNSW Canberra) > Australian Centre for Cyber Security > Australian Defence Force Academy > PO Box 7916, Canberra BA ACT 2610 > Australia > Mobile:(+61)416895795 > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Chamila Dilshan Wijayarathna, PhD Research Student The University of New South Wales (UNSW Canberra) Australian Centre for Cyber Security Australian Defence Force Academy PO Box 7916, Canberra BA ACT 2610 Australia Mobile:(+61)416895795 -------------- next part -------------- An HTML attachment was scrubbed... URL: From pedro.cardoso.code at gmail.com Sat May 30 14:34:40 2020 From: pedro.cardoso.code at gmail.com (Pedro Cardoso) Date: Sat, 30 May 2020 19:34:40 +0100 Subject: [scikit-learn] [GridSearchCV] Reduction of elapsed time at the second interation In-Reply-To: References: Message-ID: Hey Guillaume, first of all, thank you for the help. I checked my code and memory is turned of (parameter is using default). And yes, I am using a different number of features everytime. Guillaume Lema?tre escreveu no dia quarta, 27/05/2020 ?(s) 16:55: > Regarding scikit-learn, the only thing that we cache is the transformer > processing in the pipeline (see the memory parameter in Pipeline). > > It seems that you are passing a different set of features at each > iteration. Is the number of features different? > > On Sun, 29 Mar 2020 at 19:23, Pedro Cardoso > wrote: > >> Hello fellows, >> >> i am knew at slkearn and I have a question about GridSearchCV: >> >> I am running the following code at a jupyter notebook : >> >> ----------------------*code*------------------------------- >> >> opt_models = dict() >> for feature in [features1, features2, features3, features4]: >> cmb = CMB(x_train, y_train, x_test, y_test, feature) >> cmb.fit() >> cmb.predict() >> opt_models[str(feature)]=cmb.get_best_model() >> >> ------------------------------------------------------- >> >> The CMB class is just a class that contains different classification >> models (SVC, decision tree, etc...). When cmb.fit() is running, a >> gridSearchCV is performed at the SVC model (which is within the cmb >> instance) in order to tune the hyperparameters C, gamma, and kernel. The >> SCV model is implemented using the sklearn.svm.SVC class. Here is the >> output of the first and second iteration of the for loop: >> >> ---------------------*output*------------------------------------- >> -> 1st iteration >> >> >> Fitting 5 folds for each of 12 candidates, totalling 60 fits >> >> [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. >> [Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 6.1s >> [Parallel(n_jobs=-1)]: Done 2 tasks | elapsed: 6.1s >> [Parallel(n_jobs=-1)]: Done 3 tasks | elapsed: 6.1s >> [Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 6.2s >> [Parallel(n_jobs=-1)]: Done 5 tasks | elapsed: 6.2s >> [Parallel(n_jobs=-1)]: Done 6 tasks | elapsed: 6.2s >> [Parallel(n_jobs=-1)]: Done 7 tasks | elapsed: 6.2s >> [Parallel(n_jobs=-1)]: Done 8 tasks | elapsed: 6.2s >> [Parallel(n_jobs=-1)]: Done 9 tasks | elapsed: 6.2s >> [Parallel(n_jobs=-1)]: Done 10 tasks | elapsed: 6.2s >> [Parallel(n_jobs=-1)]: Done 11 tasks | elapsed: 6.2s >> [Parallel(n_jobs=-1)]: Done 12 tasks | elapsed: 6.3s >> [Parallel(n_jobs=-1)]: Done 13 tasks | elapsed: 6.3s >> [Parallel(n_jobs=-1)]: Done 14 tasks | elapsed: 6.3s >> [Parallel(n_jobs=-1)]: Done 15 tasks | elapsed: 6.4s >> [Parallel(n_jobs=-1)]: Done 16 tasks | elapsed: 6.4s >> [Parallel(n_jobs=-1)]: Done 17 tasks | elapsed: 6.4s >> [Parallel(n_jobs=-1)]: Done 18 tasks | elapsed: 6.4s >> [Parallel(n_jobs=-1)]: Done 19 tasks | elapsed: 6.5s >> [Parallel(n_jobs=-1)]: Done 20 tasks | elapsed: 6.5s >> [Parallel(n_jobs=-1)]: Done 21 tasks | elapsed: 6.5s >> [Parallel(n_jobs=-1)]: Done 22 tasks | elapsed: 6.6s >> [Parallel(n_jobs=-1)]: Done 23 tasks | elapsed: 6.7s >> [Parallel(n_jobs=-1)]: Done 24 tasks | elapsed: 6.7s >> [Parallel(n_jobs=-1)]: Done 25 tasks | elapsed: 6.7s >> [Parallel(n_jobs=-1)]: Done 26 tasks | elapsed: 6.8s >> [Parallel(n_jobs=-1)]: Done 27 tasks | elapsed: 6.8s >> [Parallel(n_jobs=-1)]: Done 28 tasks | elapsed: 6.9s >> [Parallel(n_jobs=-1)]: Done 29 tasks | elapsed: 6.9s >> [Parallel(n_jobs=-1)]: Done 30 tasks | elapsed: 6.9s >> [Parallel(n_jobs=-1)]: Done 31 tasks | elapsed: 7.0s >> [Parallel(n_jobs=-1)]: Done 32 tasks | elapsed: 7.0s >> [Parallel(n_jobs=-1)]: Done 33 tasks | elapsed: 7.0s >> [Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 7.0s >> [Parallel(n_jobs=-1)]: Done 35 tasks | elapsed: 7.1s >> [Parallel(n_jobs=-1)]: Done 36 tasks | elapsed: 7.1s >> [Parallel(n_jobs=-1)]: Done 37 tasks | elapsed: 7.2s >> [Parallel(n_jobs=-1)]: Done 38 tasks | elapsed: 7.2s >> [Parallel(n_jobs=-1)]: Done 39 tasks | elapsed: 7.2s >> [Parallel(n_jobs=-1)]: Done 40 tasks | elapsed: 7.2s >> [Parallel(n_jobs=-1)]: Done 41 tasks | elapsed: 7.3s >> [Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 7.3s >> [Parallel(n_jobs=-1)]: Done 43 tasks | elapsed: 7.3s >> [Parallel(n_jobs=-1)]: Done 44 tasks | elapsed: 7.4s >> [Parallel(n_jobs=-1)]: Done 45 tasks | elapsed: 7.4s >> [Parallel(n_jobs=-1)]: Done 46 tasks | elapsed: 7.5s >> >> >> -> 2nd iteration >> >> Fitting 5 folds for each of 12 candidates, totalling 60 fits >> >> [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. >> [Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.0s >> [Parallel(n_jobs=-1)]: Batch computation too fast (0.0260s.) Setting batch_size=14. >> [Parallel(n_jobs=-1)]: Done 2 tasks | elapsed: 0.0s >> [Parallel(n_jobs=-1)]: Done 3 tasks | elapsed: 0.0s >> [Parallel(n_jobs=-1)]: Done 4 tasks | elapsed: 0.0s >> [Parallel(n_jobs=-1)]: Done 5 tasks | elapsed: 0.0s >> [Parallel(n_jobs=-1)]: Done 60 out of 60 | elapsed: 0.7s finished >> >> >> --------------------------------------------------------------------------------------------------------------------- >> >> >> As you can see, the first iteration gets a elapsed time much larger than >> the 2nd iteration. Does it make sense? I am afraid that the model is doing >> some kind of cache or shortcut from the 1st iteration, and consequently >> could decrease the model training/performance? I already read the sklearn >> documentation and I didn't saw any warning/note about this kind of >> behaviour. >> >> Thank you very much for your time :) >> >> >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > Guillaume Lemaitre > Scikit-learn @ Inria Foundation > https://glemaitre.github.io/ > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: