From g.lemaitre58 at gmail.com Tue Aug 4 09:26:57 2020 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Tue, 4 Aug 2020 15:26:57 +0200 Subject: [scikit-learn] ANN: scikit-learn 0.23.2 release Message-ID: We are happy to announce the 0.23.2 release which fixes a couple of issues. You can see the changelog here: https://scikit-learn.org/stable/whats_new/v0.23.html#version-0-23-2 You can check this version out via pip: pip install -U scikit-learn The conda-forge builds will be available shortly, which you can then install using: conda install -c conda-forge scikit-learn On behalf of the scikit-learn development community, -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidkleiven446 at gmail.com Tue Aug 11 02:36:38 2020 From: davidkleiven446 at gmail.com (David Kleiven) Date: Tue, 11 Aug 2020 08:36:38 +0200 Subject: [scikit-learn] Tikhonov regularization Message-ID: Hi, I was looking at docs for Ridge regression and it states that it minimizes ||y - Xw||^2 + alpha*||w||^2 I would like to minimize the function ||y-Xw||^2 + ||Tx||^2, where T is a matrix, in order to impose certain properties on the solution vectors, but I haven't found any way to achieve that in scikit-learn. Is this type of regularisation supported in scikit-learn? More details on the ||Tx||^2 regularisation can be found here https://en.wikipedia.org/wiki/Tikhonov_regularization Best, David -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.eickenberg at gmail.com Tue Aug 11 11:23:14 2020 From: michael.eickenberg at gmail.com (Michael Eickenberg) Date: Tue, 11 Aug 2020 08:23:14 -0700 Subject: [scikit-learn] Tikhonov regularization In-Reply-To: References: Message-ID: Hi David, I am assuming you mean that T acts on w. If T is invertible, you can absorb it into the design matrix by making a change of variable v=Tw, w=T^-1 v, and use standard ridge regression for v. If it is not (e.g. when T is a standard finite difference derivative operator) then this trick won't work. A second thing you can do is to fit standard linear regression on the augmented data matrix vstack([X, factor * T]) and the augmented target concatenate([y, np.zeros(T.shape[0])]). At worst you can compute the gradient of your loss function X^T(Xw - y) + T^Tw and perform gradient descent or compute w = (X^T X + T^T T)^{-1}X^T y. Hope this helps Michael On Mon, Aug 10, 2020 at 11:39 PM David Kleiven wrote: > Hi, > > I was looking at docs for Ridge regression and it states that it minimizes > > ||y - Xw||^2 + alpha*||w||^2 > > I would like to minimize the function > > ||y-Xw||^2 + ||Tx||^2, where T is a matrix, in order to impose certain > properties on the solution vectors, but I haven't found any way to achieve > that in scikit-learn. Is this type of regularisation supported in > scikit-learn? > > More details on the ||Tx||^2 regularisation can be found here > > https://en.wikipedia.org/wiki/Tikhonov_regularization > > Best, > David > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mainakjas at gmail.com Tue Aug 11 11:38:26 2020 From: mainakjas at gmail.com (Mainak Jas) Date: Tue, 11 Aug 2020 11:38:26 -0400 Subject: [scikit-learn] Tikhonov regularization In-Reply-To: References: Message-ID: Hi David, Michael has great ideas and they might serve your purpose. If not and if you are willing to try another software package that is compatible with the scikit-learn ecosystem, you can look into pyglmnet: http://glm-tools.github.io/pyglmnet/auto_examples/plot_tikhonov.html#sphx-glr-auto-examples-plot-tikhonov-py Hope this helps, Mainak On Tue, Aug 11, 2020 at 11:24 AM Michael Eickenberg < michael.eickenberg at gmail.com> wrote: > Hi David, > > I am assuming you mean that T acts on w. > If T is invertible, you can absorb it into the design matrix by making a > change of variable v=Tw, w=T^-1 v, and use standard ridge regression for v. > If it is not (e.g. when T is a standard finite difference derivative > operator) then this trick won't work. > A second thing you can do is to fit standard linear regression on the > augmented data matrix vstack([X, factor * T]) and the augmented target > concatenate([y, np.zeros(T.shape[0])]). > > At worst you can compute the gradient of your loss function X^T(Xw - y) + > T^Tw and perform gradient descent or compute w = (X^T X + T^T T)^{-1}X^T y. > > Hope this helps > > Michael > > On Mon, Aug 10, 2020 at 11:39 PM David Kleiven > wrote: > >> Hi, >> >> I was looking at docs for Ridge regression and it states that it minimizes >> >> ||y - Xw||^2 + alpha*||w||^2 >> >> I would like to minimize the function >> >> ||y-Xw||^2 + ||Tx||^2, where T is a matrix, in order to impose certain >> properties on the solution vectors, but I haven't found any way to achieve >> that in scikit-learn. Is this type of regularisation supported in >> scikit-learn? >> >> More details on the ||Tx||^2 regularisation can be found here >> >> https://en.wikipedia.org/wiki/Tikhonov_regularization >> >> Best, >> David >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidkleiven446 at gmail.com Wed Aug 12 01:52:13 2020 From: davidkleiven446 at gmail.com (David Kleiven) Date: Wed, 12 Aug 2020 07:52:13 +0200 Subject: [scikit-learn] Tikhonov regularization In-Reply-To: References: Message-ID: Hi, thanks for your suggestions. I will try both options. Best, David On Tue, Aug 11, 2020 at 5:39 PM Mainak Jas wrote: > Hi David, > > Michael has great ideas and they might serve your purpose. If not and if > you are willing to try another software package that is compatible with the > scikit-learn ecosystem, you can look into pyglmnet: > > > http://glm-tools.github.io/pyglmnet/auto_examples/plot_tikhonov.html#sphx-glr-auto-examples-plot-tikhonov-py > > Hope this helps, > Mainak > > On Tue, Aug 11, 2020 at 11:24 AM Michael Eickenberg < > michael.eickenberg at gmail.com> wrote: > >> Hi David, >> >> I am assuming you mean that T acts on w. >> If T is invertible, you can absorb it into the design matrix by making a >> change of variable v=Tw, w=T^-1 v, and use standard ridge regression for v. >> If it is not (e.g. when T is a standard finite difference derivative >> operator) then this trick won't work. >> A second thing you can do is to fit standard linear regression on the >> augmented data matrix vstack([X, factor * T]) and the augmented target >> concatenate([y, np.zeros(T.shape[0])]). >> >> At worst you can compute the gradient of your loss function X^T(Xw - y) + >> T^Tw and perform gradient descent or compute w = (X^T X + T^T T)^{-1}X^T y. >> >> Hope this helps >> >> Michael >> >> On Mon, Aug 10, 2020 at 11:39 PM David Kleiven >> wrote: >> >>> Hi, >>> >>> I was looking at docs for Ridge regression and it states that it >>> minimizes >>> >>> ||y - Xw||^2 + alpha*||w||^2 >>> >>> I would like to minimize the function >>> >>> ||y-Xw||^2 + ||Tx||^2, where T is a matrix, in order to impose certain >>> properties on the solution vectors, but I haven't found any way to achieve >>> that in scikit-learn. Is this type of regularisation supported in >>> scikit-learn? >>> >>> More details on the ||Tx||^2 regularisation can be found here >>> >>> https://en.wikipedia.org/wiki/Tikhonov_regularization >>> >>> Best, >>> David >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From anna.jenul at nmbu.no Wed Aug 12 09:35:16 2020 From: anna.jenul at nmbu.no (Anna Jenul) Date: Wed, 12 Aug 2020 13:35:16 +0000 Subject: [scikit-learn] make_classification question Message-ID: <66d8d42122674f33b2a80a2f29f9e9ab@EXCH-MBX05.NMBU.NO> Hi! I am generating own datasets with sklearn.datasets.make_classification. Unfortunately, I cannot figure out which of the generated features are the informative ones. In my example I generate "n_features=1000" and "n_informative=20". Is there any possibility to get the informative features after the dataset is generated? Thanks, Anna -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Wed Aug 12 11:12:37 2020 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Wed, 12 Aug 2020 10:12:37 -0500 Subject: [scikit-learn] make_classification question In-Reply-To: <66d8d42122674f33b2a80a2f29f9e9ab@EXCH-MBX05.NMBU.NO> References: <66d8d42122674f33b2a80a2f29f9e9ab@EXCH-MBX05.NMBU.NO> Message-ID: <447426C3-1840-4BD6-8DEC-4DE0758F73EB@sebastianraschka.com> Hi Anna, You can set shuffle=False (it's set to True by default in the make_classification function). Then, the resulting features will be sorted as follows: X[:, :n_informative + n_redundant + n_repeated]. I.e., if you set ?n_features=1000? and ?n_informative=20?, the first 20 features will be the informative ones. Best, Sebastian > On Aug 12, 2020, at 8:35 AM, Anna Jenul wrote: > > Hi! > I am generating own datasets with sklearn.datasets.make_classification. Unfortunately, I cannot figure out which of the generated features are the informative ones. In my example I generate ?n_features=1000? and ?n_informative=20?. Is there any possibility to get the informative features after the dataset is generated? > Thanks, > Anna > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From anna.jenul at nmbu.no Thu Aug 13 04:23:13 2020 From: anna.jenul at nmbu.no (Anna Jenul) Date: Thu, 13 Aug 2020 08:23:13 +0000 Subject: [scikit-learn] scikit-learn] make_classification question Message-ID: <656ec966d46a4080a437c5ecfeba326b@EXCH-MBX05.NMBU.NO> Message-ID: <66d8d42122674f33b2a80a2f29f9e9ab at EXCH-MBX05.NMBU.NO> Thank you very much for your help! Best, Anna -------------- next part -------------- An HTML attachment was scrubbed... URL: From fernando.wittmann at gmail.com Sun Aug 16 14:04:11 2020 From: fernando.wittmann at gmail.com (Fernando Marcos Wittmann) Date: Sun, 16 Aug 2020 15:04:11 -0300 Subject: [scikit-learn] Opinion on reference mentioning that RF uses weak learners Message-ID: Hello guys, The the following reference states that Random Forests uses weak learners: - https://blog.citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metrics#:~:text=The%20random%20forest%20starts%20with,corresponds%20to%20our%20weak%20learner.&text=Thus%2C%20in%20ensemble%20terms%2C%20the,forest%20is%20a%20strong%20learner The random forest starts with a standard machine learning technique called > a ?decision tree? which, in ensemble terms, corresponds to our weak learner. ... Thus, in ensemble terms, the trees are weak learners and the random forest > is a strong learner. I completely disagree with that statement. But I would like the opinion of the community to double check if I am not missing something. -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthieu.brucher at gmail.com Sun Aug 16 14:11:19 2020 From: matthieu.brucher at gmail.com (Matthieu Brucher) Date: Sun, 16 Aug 2020 19:11:19 +0100 Subject: [scikit-learn] Opinion on reference mentioning that RF uses weak learners In-Reply-To: References: Message-ID: Hi, What are you wondering? The individual tree is weakened by design (accepts more errors), so indeed, the individual trees are weak learners and the combination of them (the forest) becomes the strong learner. You can have a strong tree as well (deeper, more parameters), but that's not what is searched in a random forest. Cheers, Matthieu Le dim. 16 ao?t 2020 ? 19:06, Fernando Marcos Wittmann a ?crit : > > Hello guys, > > The the following reference states that Random Forests uses weak learners: > - https://blog.citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metrics#:~:text=The%20random%20forest%20starts%20with,corresponds%20to%20our%20weak%20learner.&text=Thus%2C%20in%20ensemble%20terms%2C%20the,forest%20is%20a%20strong%20learner > >> The random forest starts with a standard machine learning technique called a ?decision tree? which, in ensemble terms, corresponds to our weak learner. >> >> ... >> >> Thus, in ensemble terms, the trees are weak learners and the random forest is a strong learner. > > > I completely disagree with that statement. But I would like the opinion of the community to double check if I am not missing something. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Quantitative researcher, Ph.D. Blog: http://blog.audio-tk.com/ LinkedIn: http://www.linkedin.com/in/matthieubrucher From g.lemaitre58 at gmail.com Sun Aug 16 14:29:14 2020 From: g.lemaitre58 at gmail.com (=?ISO-8859-1?Q?Guillaume_Lema=EEtre?=) Date: Sun, 16 Aug 2020 20:29:14 +0200 Subject: [scikit-learn] Opinion on reference mentioning that RF uses weak learners In-Reply-To: Message-ID: An HTML attachment was scrubbed... URL: From niourf at gmail.com Sun Aug 16 15:22:26 2020 From: niourf at gmail.com (Nicolas Hug) Date: Sun, 16 Aug 2020 15:22:26 -0400 Subject: [scikit-learn] Opinion on reference mentioning that RF uses weak learners In-Reply-To: References: Message-ID: <9285014a-a591-ec41-796b-cf74f6dd9ed1@gmail.com> As previously mentioned, a "weak learner" is just a learner that barely performs better than random. It's more common in the context of boosting, but I think weak learning predates boosting, and the original RF paper by Breiman does make reference to "weak learners": > It's interesting that Forest-RI could produce error rates not far > above the Bayeserror rate. The individual classifiers are weak. For > F=1, the average tree errorrate is 80%; for F=10, it is 65%; and for > F=25, it is 60%. Forests seem to have theability to work with very > weak classifiers as long as their correlation is low Nicolas On 8/16/20 2:29 PM, Guillaume Lema?tre wrote: > One needs to define what is the definition of weak learner. > > In boosting, if I recall well the literature, weak learner refers to > learner which unfit performing slightly better than a random learner. > In this regard, a tree with shallow depth will be a weak learner and > is used in adaboost or gradient boosting. > > However, in random forest the tree used are trees that overfit (deep > tree) so they are not weak for the same reason. However, one will > never be able to do what a forest will do with a single tree. In this > regard, a single tree is weaker than the forest. However, I never read > the term for "weak learner" in the context of the random forest. > > Sent from my phone - sorry to be brief and potential misspell. > > *From:* fernando.wittmann at gmail.com > *Sent:* 16 August 2020 20:06 > *To:* scikit-learn at python.org > *Reply to:* scikit-learn at python.org > *Subject:* [scikit-learn] Opinion on reference mentioning that RF uses > weak learners > > > Hello guys, > > The the following reference states that Random Forests uses weak learners: > - > https://blog.citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metrics#:~:text=The%20random%20forest%20starts%20with,corresponds%20to%20our%20weak%20learner.&text=Thus%2C%20in%20ensemble%20terms%2C%20the,forest%20is%20a%20strong%20learner > > The random forest starts with a standard machine learning > technique called a ?decision tree? which, in ensemble terms, > corresponds to our weak learner. > > ... > > Thus, in ensemble terms, the trees are weak learners and the > random forest is a strong learner. > > > I completely disagree with that statement. But I would like the > opinion of the?community?to double check if I am not missing?something. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From fernando.wittmann at gmail.com Sun Aug 16 15:57:36 2020 From: fernando.wittmann at gmail.com (Fernando Marcos Wittmann) Date: Sun, 16 Aug 2020 16:57:36 -0300 Subject: [scikit-learn] Opinion on reference mentioning that RF uses weak learners In-Reply-To: <9285014a-a591-ec41-796b-cf74f6dd9ed1@gmail.com> References: <9285014a-a591-ec41-796b-cf74f6dd9ed1@gmail.com> Message-ID: In my opinion the reference is distorting a concept that has a consolidated definition in the community. I am also familiar with the definition of WL as "an estimator slightly better than guessing", mostly decision stumps ( https://en.m.wikipedia.org/wiki/Decision_stump), which is not an component of RFs. On Sun, Aug 16, 2020, 16:22 Nicolas Hug wrote: > As previously mentioned, a "weak learner" is just a learner that barely > performs better than random. It's more common in the context of boosting, > but I think weak learning predates boosting, and the original RF paper by > Breiman does make reference to "weak learners": > > It's interesting that Forest-RI could produce error rates not far above > the Bayeserror rate. The individual classifiers are weak. For F=1, the > average tree errorrate is 80%; for F=10, it is 65%; and for F=25, it is > 60%. Forests seem to have theability to work with very weak classifiers > as long as their correlation is low > > Nicolas > > > On 8/16/20 2:29 PM, Guillaume Lema?tre wrote: > > One needs to define what is the definition of weak learner. > > In boosting, if I recall well the literature, weak learner refers to > learner which unfit performing slightly better than a random learner. In > this regard, a tree with shallow depth will be a weak learner and is used > in adaboost or gradient boosting. > > However, in random forest the tree used are trees that overfit (deep tree) > so they are not weak for the same reason. However, one will never be able > to do what a forest will do with a single tree. In this regard, a single > tree is weaker than the forest. However, I never read the term for "weak > learner" in the context of the random forest. > > Sent from my phone - sorry to be brief and potential misspell. > *From:* fernando.wittmann at gmail.com > *Sent:* 16 August 2020 20:06 > *To:* scikit-learn at python.org > *Reply to:* scikit-learn at python.org > *Subject:* [scikit-learn] Opinion on reference mentioning that RF uses > weak learners > > Hello guys, > > The the following reference states that Random Forests uses weak learners: > - > https://blog.citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metrics#:~:text=The%20random%20forest%20starts%20with,corresponds%20to%20our%20weak%20learner.&text=Thus%2C%20in%20ensemble%20terms%2C%20the,forest%20is%20a%20strong%20learner > > The random forest starts with a standard machine learning technique called >> a ?decision tree? which, in ensemble terms, corresponds to our weak learner. > > ... > > Thus, in ensemble terms, the trees are weak learners and the random >> forest is a strong learner. > > > I completely disagree with that statement. But I would like the opinion of > the community to double check if I am not missing something. > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Sun Aug 16 20:37:43 2020 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Mon, 17 Aug 2020 09:37:43 +0900 Subject: [scikit-learn] Opinion on reference mentioning that RF uses weak learners In-Reply-To: References: <9285014a-a591-ec41-796b-cf74f6dd9ed1@gmail.com> Message-ID: > As previously mentioned, a "weak learner" is just a learner that barely performs better than random. To continue with what the definition of a random learner refers to, does it mean the following contexts? (1) Classification: a learner which uniformly samples from one of the N endpoints in the training data (e.g., the set of unique values in the response vector "y"). (2) Regression: a learner which uniformly samples from the range of values in the endpoint/response vector (e.g., uniform sampling from [min(y), max(y)]). Should even more context be explicitly declared (e.g., not uniform sampling but any distribution sampler)? J.B. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ram at rachum.com Mon Aug 17 03:53:36 2020 From: ram at rachum.com (Ram Rachum) Date: Mon, 17 Aug 2020 10:53:36 +0300 Subject: [scikit-learn] Imputers and DataFrame objects Message-ID: Hey guys, This is a bit of a complicated question. I was helping my friend do a task with Pandas/sklearn for her data science class. I figured it'll be a breeze, since I'm fancy-pancy Python programmer. Oh wow, it was so not. I was trying to do things that felt simple to me, but there were so many problems, I spent 2 hours and only had a partial solution. I'm wondering whether I'm missing something. She got a CSV with lots of data about cars. Some of the data had missing values (marked with "?"). Additionally, some columns had small numbers written as strings like "one", "two", "three", etc. There were maybe a few more issues like these. The task was to remove these irregularities. So for the "?" items, replace them with mean, and for the "one", "two" etc. replace with a numerical value. I could easily write my own logic that does that, but she told me I should use the tools that come with sklearn: SimpleImputer, OneHotEncoder, BinaryEncoder for the "one" "two" "three". They gave me so, so many problems. For one, I couldn't figure out how to apply SimpleImputer on just one column in the DataFrame, and then get the results in the form of a dataframe. (Either changing in-place or creating a new DataFrame.) I think I spent an hour on this problem alone. Eventually I found a way , but it definitely felt like I was doing something wrong, like this is supposed to be simpler. Also, when trying to use BinaryEncoder for "one" "two" "three", it raised an exception because there were NaN values there. Well, I wanted to first convert them to real numbers and then use the same SimpleImputer to fix these. But I couldn't, because of the exception. Any insight you could give me would be useful. Thanks, Ram. -------------- next part -------------- An HTML attachment was scrubbed... URL: From niourf at gmail.com Mon Aug 17 09:33:10 2020 From: niourf at gmail.com (Nicolas Hug) Date: Mon, 17 Aug 2020 09:33:10 -0400 Subject: [scikit-learn] Opinion on reference mentioning that RF uses weak learners In-Reply-To: References: <9285014a-a591-ec41-796b-cf74f6dd9ed1@gmail.com> Message-ID: <30b33c58-b92e-f8ac-dd6e-4b8dd2b771f1@gmail.com> I'm not sure honestly, but I think you'll find more details in Schapire's paper (http://rob.schapire.net/papers/strengthofweak.pdf) and its refs. In particular page 5 (201) On 8/16/20 8:37 PM, Brown J.B. via scikit-learn wrote: > > As previously mentioned, a "weak learner" is just a learner that > barely performs better than random. > > To continue with what the definition of a random learner refers to, > does it mean the following contexts? > (1) Classification: a learner which uniformly samples from one of the > N endpoints in the training data (e.g., the set of unique values in > the response vector "y"). > (2) Regression: a learner which uniformly samples from the range of > values in the endpoint/response vector (e.g., uniform sampling from > [min(y), max(y)]). > > Should even more context be explicitly declared (e.g., not uniform > sampling but any distribution sampler)? > > J.B. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From niourf at gmail.com Mon Aug 17 12:36:05 2020 From: niourf at gmail.com (Nicolas Hug) Date: Mon, 17 Aug 2020 12:36:05 -0400 Subject: [scikit-learn] =?utf-8?q?ANN=3A_Welcoming_Christian_Lorentzen_an?= =?utf-8?q?d_Juan_Carlos_Alfaro_Jim=C3=A9nez?= Message-ID: <3c072c5b-1d1c-d91c-9278-2a578140d1e9@gmail.com> The core developers of Scikit-learn have recently voted to welcome Christian Lorentzen to the core dev team, and Juan Carlos Alfaro Jim?nez to the triage team, in recognition of their efforts and trustworthiness as contributors. Congratulations to you both and thank you for your contributions! Welcome! From kevin at dataschool.io Mon Aug 17 13:53:29 2020 From: kevin at dataschool.io (Kevin Markham) Date: Mon, 17 Aug 2020 13:53:29 -0400 Subject: [scikit-learn] Imputers and DataFrame objects In-Reply-To: References: Message-ID: Hi Ram, These are great questions! > The task was to remove these irregularities. So for the "?" items, replace them with mean, and for the "one", "two" etc. replace with a numerical value. If your primary task is "data cleaning", then pandas is usually the optimal tool. If "preprocessing your data for Machine Learning" is your primary task, then scikit-learn is usually the optimal tool. There is some overlap between what is considered "cleaning" and "preprocessing", but I mention this distinction because it can help you decide what tool to use. > she told me I should use the tools that come with sklearn: SimpleImputer, OneHotEncoder, BinaryEncoder for the "one" "two" "three". Just for clarification, BinaryEncoder is not part of scikit-learn. Instead, it's part of the Category Encoders library, which is a related project to scikit-learn. > For one, I couldn't figure out how to apply SimpleImputer on just one column in the DataFrame, and then get the results in the form of a dataframe. Like most scikit-learn transformers, SimpleImputer expects 2-dimensional input. In your case, this would be a 1-column DataFrame (such as df[['col']]) rather than a Series (such as df['col']). Also like most scikit-learn transformers, SimpleImputer outputs a NumPy array. If you need the output to be a DataFrame, one option is to convert the array to a pandas object and concatenate it to the original DataFrame. > Also, when trying to use BinaryEncoder for "one" "two" "three", it raised an exception because there were NaN values there. Neither OneHotEncoder nor BinaryEncoder will help you to replace these string values with the corresponding numbers. Instead, I recommend using the pandas DataFrame map method. Alternatively, if you need to do this mapping operation within scikit-learn, you could wrap the pandas functionality into a custom scikit-learn transformer using FunctionTransformer. That is a bit more complicated, though it does have the benefit that you can chain it into a Pipeline with a SimpleImputer. But again, this is more complicated and is not the recommended approach unless you are already fluent with the scikit-learn API. > Any insight you could give me would be useful. It sounds like using pandas for the tasks you described is the optimal approach, but I'm basing that opinion purely on what I know from your email. Hope that helps! Kevin On Mon, Aug 17, 2020 at 3:54 AM Ram Rachum wrote: > Hey guys, > > This is a bit of a complicated question. > > I was helping my friend do a task with Pandas/sklearn for her data science > class. I figured it'll be a breeze, since I'm fancy-pancy Python > programmer. Oh wow, it was so not. > > I was trying to do things that felt simple to me, but there were so many > problems, I spent 2 hours and only had a partial solution. I'm wondering > whether I'm missing something. > > She got a CSV with lots of data about cars. Some of the data had missing > values (marked with "?"). Additionally, some columns had small numbers > written as strings like "one", "two", "three", etc. There were maybe a few > more issues like these. > > The task was to remove these irregularities. So for the "?" items, replace > them with mean, and for the "one", "two" etc. replace with a numerical > value. > > I could easily write my own logic that does that, but she told me I should > use the tools that come with sklearn: SimpleImputer, OneHotEncoder, > BinaryEncoder for the "one" "two" "three". > > They gave me so, so many problems. For one, I couldn't figure out how to > apply SimpleImputer on just one column in the DataFrame, and then get the > results in the form of a dataframe. (Either changing in-place or creating a > new DataFrame.) I think I spent an hour on this problem alone. Eventually I found > a way , but it > definitely felt like I was doing something wrong, like this is supposed to > be simpler. > > Also, when trying to use BinaryEncoder for "one" "two" "three", it raised > an exception because there were NaN values there. Well, I wanted to first > convert them to real numbers and then use the same SimpleImputer to fix > these. But I couldn't, because of the exception. > > Any insight you could give me would be useful. > > > Thanks, > Ram. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Kevin Markham Founder, Data School https://www.dataschool.io https://www.youtube.com/dataschool https://www.patreon.com/dataschool -------------- next part -------------- An HTML attachment was scrubbed... URL: From ram at rachum.com Tue Aug 18 07:56:11 2020 From: ram at rachum.com (Ram Rachum) Date: Tue, 18 Aug 2020 14:56:11 +0300 Subject: [scikit-learn] Imputers and DataFrame objects In-Reply-To: References: Message-ID: On Mon, Aug 17, 2020 at 8:55 PM Kevin Markham wrote: > Hi Ram, > > These are great questions! > Thank you for the detailed answers. > > > The task was to remove these irregularities. So for the "?" items, > replace them with mean, and for the "one", "two" etc. replace with a > numerical value. > > If your primary task is "data cleaning", then pandas is usually the > optimal tool. If "preprocessing your data for Machine Learning" is your > primary task, then scikit-learn is usually the optimal tool. There is some > overlap between what is considered "cleaning" and "preprocessing", but I > mention this distinction because it can help you decide what tool to use. > Okay, but here's one example where it gets tricky. For a column with numbers written like "one", "two" and missing values "?", I had to do two things: Change them to numbers (1, 2), and then, instead of the missing values, add the most common element, or mean or whatever. When I tried to use LabelEncoder to do the first part, it complained about the missing values. I couldn't fix these missing values until the labels were changed to ints. So that put me in a frustrating Catch-22 situation, and all the while I'm thinking "It would be so much simpler to just write my own logic in a for-loop rather than try to get Pandas and scikit-learn working together. Any insights about that? > > For one, I couldn't figure out how to apply SimpleImputer on just one > column in the DataFrame, and then get the results in the form of a > dataframe. > > Like most scikit-learn transformers, SimpleImputer expects 2-dimensional > input. In your case, this would be a 1-column DataFrame (such as > df[['col']]) rather than a Series (such as df['col']). > > Also like most scikit-learn transformers, SimpleImputer outputs a NumPy > array. If you need the output to be a DataFrame, one option is to convert > the array to a pandas object and concatenate it to the original DataFrame. > Well, I did do that in the `process_column` helper function in the code I linked to above. But it kind of felt like... What am I using a framework for to begin with? Because that kind of logistics is the reason I want to use a framework instead of managing my own arrays and imputing logic. Thanks for your help Kevin. -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Tue Aug 18 11:11:46 2020 From: adrin.jalali at gmail.com (Adrin) Date: Tue, 18 Aug 2020 17:11:46 +0200 Subject: [scikit-learn] Fwd: [Numpy-discussion] start of an array (tensor) and dataframe API standardization initiative In-Reply-To: References: Message-ID: FYI: Related to the data-frame like discussions we've been having. ---------- Forwarded message --------- From: Ralf Gommers Date: Mon., Aug. 17, 2020, 22:35 Subject: [Numpy-discussion] start of an array (tensor) and dataframe API standardization initiative To: Discussion of Numerical Python Hi all, I'd like to share this announcement blog post about the creation of a consortium for array and dataframe API standardization here: https://data-apis.org/blog/announcing_the_consortium/. It's still in the beginning stages, but starting to take shape. We have participation from one or more maintainers of most array and tensor libraries - NumPy, TensorFlow, PyTorch, MXNet, Dask, JAX, Xarray. Stephan Hoyer, Travis Oliphant and myself have been providing input from a NumPy perspective. The effort is very much related to some of the interoperability work we've been doing in NumPy (e.g. it could provide an answer to what's described in https://numpy.org/neps/nep-0037-array-module.html#requesting-restricted-subsets-of-numpy-s-api ). At this point we're looking for feedback from maintainers at a high level (see the blog post for details). Also important: the python-record-api tooling and data in its repo has very granular API usage data, of the kind we could really use when making decisions that impact backwards compatibility. Cheers, Ralf _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion at python.org https://mail.python.org/mailman/listinfo/numpy-discussion -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Tue Aug 18 11:24:11 2020 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 18 Aug 2020 17:24:11 +0200 Subject: [scikit-learn] Fwd: [Numpy-discussion] start of an array (tensor) and dataframe API standardization initiative In-Reply-To: References: Message-ID: <20200818152411.uqzjoxz7423man6m@phare.normalesup.org> Yes, I think that I kickstarted this a few month ago: https://discuss.ossdata.org/t/a-dataframe-protocol-for-the-pydata-ecosystem/267 I really hope that this will help us serve better the community in scikit-learn! G On Tue, Aug 18, 2020 at 05:11:46PM +0200, Adrin wrote: > FYI: Related to the data-frame like discussions we've been having. > ---------- Forwarded message --------- > From: Ralf Gommers > Date: Mon., Aug. 17, 2020, 22:35 > Subject: [Numpy-discussion] start of an array (tensor) and dataframe API > standardization initiative > To: Discussion of Numerical Python > Hi all, > I'd like to share this announcement blog post about the creation of a > consortium for array and dataframe API standardization here: https:// > data-apis.org/blog/announcing_the_consortium/. It's still in the beginning > stages, but starting to take shape. We have participation from one or more > maintainers of most array and tensor libraries - NumPy, TensorFlow, PyTorch, > MXNet, Dask, JAX, Xarray. Stephan Hoyer, Travis Oliphant and myself have been > providing input from a NumPy perspective. > The effort is very much related to some of the interoperability work we've been > doing in NumPy (e.g. it could provide an answer to what's described in https:// > numpy.org/neps/nep-0037-array-module.html# > requesting-restricted-subsets-of-numpy-s-api). > At this point we're looking for feedback from maintainers at a high level (see > the blog post for details). > Also important: the python-record-api tooling and data in its repo has very > granular API usage data, of the kind we could really use when making decisions > that impact backwards compatibility. > Cheers, > Ralf > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion at python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Research Director, INRIA Visiting professor, McGill http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From kevin at dataschool.io Tue Aug 18 11:39:19 2020 From: kevin at dataschool.io (Kevin Markham) Date: Tue, 18 Aug 2020 11:39:19 -0400 Subject: [scikit-learn] Imputers and DataFrame objects In-Reply-To: References: Message-ID: Hi Ram, > For a column with numbers written like "one", "two" and missing values "?", I had to do two things: Change them to numbers (1, 2), and then, instead of the missing values, add the most common element, or mean or whatever. When I tried to use LabelEncoder to do the first part, it complained about the missing values. LabelEncoder is not the right tool for this task. It does map strings to integers, but it's not a tool for mapping *particular* strings to *particular* integers. More generally: LabelEncoder is a tool for encoding a label, not a tool for data cleaning (which is how I would describe your task). > all the while I'm thinking "It would be so much simpler to just write my own logic in a for-loop rather than try to get Pandas and scikit-learn working together. I wouldn't describe this as a case in which "pandas and scikit-learn aren't working well together." Rather, I would describe this as a case of trying to use a scikit-learn function when what you actually need is a pandas function. Here's a solution to your problem in two lines of pandas code: df['col'] = df['col'].map({'one':1, 'two':2, '?':np.nan}) df['col'] = df['col'].fillna(df['col'].mean()) Showing you that there is a simple solution is not a critique of you. Rather, pandas and scikit-learn are complex tools with huge APIs, and it takes time to master them. And to be clear, I'm not critiquing the tools either: they are complex tools with huge APIs because they are addressing complex problems with lots of functional areas. > But it kind of felt like... What am I using a framework for to begin with? I think you will find that pandas and scikit-learn can save you a lot of code, but it does require finding the right function or class. Learning these tools requires an investment of time, and many people have found that this investment is well worth it. However, solving your problems with custom code is always an option, and it's totally fine if that is your preferred option! Hope that helps, Kevin On Tue, Aug 18, 2020 at 7:56 AM Ram Rachum wrote: > > > On Mon, Aug 17, 2020 at 8:55 PM Kevin Markham wrote: > >> Hi Ram, >> >> These are great questions! >> > > Thank you for the detailed answers. > >> >> > The task was to remove these irregularities. So for the "?" items, >> replace them with mean, and for the "one", "two" etc. replace with a >> numerical value. >> >> If your primary task is "data cleaning", then pandas is usually the >> optimal tool. If "preprocessing your data for Machine Learning" is your >> primary task, then scikit-learn is usually the optimal tool. There is some >> overlap between what is considered "cleaning" and "preprocessing", but I >> mention this distinction because it can help you decide what tool to use. >> > > Okay, but here's one example where it gets tricky. For a column with > numbers written like "one", "two" and missing values "?", I had to do two > things: Change them to numbers (1, 2), and then, instead of the missing > values, add the most common element, or mean or whatever. When I tried to > use LabelEncoder to do the first part, it complained about the missing > values. I couldn't fix these missing values until the labels were changed > to ints. So that put me in a frustrating Catch-22 situation, and all the > while I'm thinking "It would be so much simpler to just write my own logic > in a for-loop rather than try to get Pandas and scikit-learn working > together. > > Any insights about that? > > >> > For one, I couldn't figure out how to apply SimpleImputer on just one >> column in the DataFrame, and then get the results in the form of a >> dataframe. >> >> Like most scikit-learn transformers, SimpleImputer expects 2-dimensional >> input. In your case, this would be a 1-column DataFrame (such as >> df[['col']]) rather than a Series (such as df['col']). >> >> Also like most scikit-learn transformers, SimpleImputer outputs a NumPy >> array. If you need the output to be a DataFrame, one option is to convert >> the array to a pandas object and concatenate it to the original DataFrame. >> > > Well, I did do that in the `process_column` helper function in the code I > linked to above. But it kind of felt like... What am I using a framework > for to begin with? Because that kind of logistics is the reason I want to > use a framework instead of managing my own arrays and imputing logic. > > Thanks for your help Kevin. > -- Kevin Markham Founder, Data School https://www.dataschool.io https://www.youtube.com/dataschool https://www.patreon.com/dataschool -------------- next part -------------- An HTML attachment was scrubbed... URL: From ram at rachum.com Tue Aug 18 14:41:10 2020 From: ram at rachum.com (Ram Rachum) Date: Tue, 18 Aug 2020 21:41:10 +0300 Subject: [scikit-learn] Imputers and DataFrame objects In-Reply-To: References: Message-ID: On Tue, Aug 18, 2020 at 6:53 PM Kevin Markham wrote: > Hi Ram, > > > For a column with numbers written like "one", "two" and missing values > "?", I had to do two things: Change them to numbers (1, 2), and then, > instead of the missing values, add the most common element, or mean or > whatever. When I tried to use LabelEncoder to do the first part, it > complained about the missing values. > > LabelEncoder is not the right tool for this task. It does map strings to > integers, but it's not a tool for mapping *particular* strings to > *particular* integers. More generally: LabelEncoder is a tool for encoding > a label, not a tool for data cleaning (which is how I would describe your > task). > > > all the while I'm thinking "It would be so much simpler to just write my > own logic in a for-loop rather than try to get Pandas and scikit-learn > working together. > > I wouldn't describe this as a case in which "pandas and scikit-learn > aren't working well together." Rather, I would describe this as a case of > trying to use a scikit-learn function when what you actually need is a > pandas function. > > Here's a solution to your problem in two lines of pandas code: > df['col'] = df['col'].map({'one':1, 'two':2, '?':np.nan}) > df['col'] = df['col'].fillna(df['col'].mean()) > > Showing you that there is a simple solution is not a critique of you. > Rather, pandas and scikit-learn are complex tools with huge APIs, and it > takes time to master them. And to be clear, I'm not critiquing the tools > either: they are complex tools with huge APIs because they are addressing > complex problems with lots of functional areas. > I understand, that makes sense. Thank you. > > > But it kind of felt like... What am I using a framework for to begin > with? > > I think you will find that pandas and scikit-learn can save you a lot of > code, but it does require finding the right function or class. Learning > these tools requires an investment of time, and many people have found that > this investment is well worth it. > > However, solving your problems with custom code is always an option, and > it's totally fine if that is your preferred option! > > Hope that helps, > > Kevin > > Thanks for your help Kevin. -------------- next part -------------- An HTML attachment was scrubbed... URL: From solegalli at protonmail.com Wed Aug 19 02:35:41 2020 From: solegalli at protonmail.com (Sole Galli) Date: Wed, 19 Aug 2020 06:35:41 +0000 Subject: [scikit-learn] Imputers and DataFrame objects In-Reply-To: References: Message-ID: Did you have a look at the package feature-engine? It has its own imputers and encoders that allow you to select the columns to transform and returns a dataframe. It also has a sklear wrapper that wraps sklearn transformers so that they return a dataframe instead of a numpy array. Cheers. Sole Sent from ProtonMail mobile -------- Original Message -------- On 18 Aug 2020, 13:56, Ram Rachum wrote: > On Mon, Aug 17, 2020 at 8:55 PM Kevin Markham wrote: > >> Hi Ram, >> >> These are great questions! > > Thank you for the detailed answers. > >>> The task was to remove these irregularities. So for the "?" items, replace them with mean, and for the "one", "two" etc. replace with a numerical value. >> >> If your primary task is "data cleaning", then pandas is usually the optimal tool. If "preprocessing your data for Machine Learning" is your primary task, then scikit-learn is usually the optimal tool. There is some overlap between what is considered "cleaning" and "preprocessing", but I mention this distinction because it can help you decide what tool to use. > > Okay, but here's one example where it gets tricky. For a column with numbers written like "one", "two" and missing values "?", I had to do two things: Change them to numbers (1, 2), and then, instead of the missing values, add the most common element, or mean or whatever. When I tried to use LabelEncoder to do the first part, it complained about the missing values. I couldn't fix these missing values until the labels were changed to ints. So that put me in a frustrating Catch-22 situation, and all the while I'm thinking "It would be so much simpler to just write my own logic in a for-loop rather than try to get Pandas and scikit-learn working together. > > Any insights about that? > >>> For one, I couldn't figure out how to apply SimpleImputer on just one column in the DataFrame, and then get the results in the form of a dataframe. >> >> Like most scikit-learn transformers, SimpleImputer expects 2-dimensional input. In your case, this would be a 1-column DataFrame (such as df[['col']]) rather than a Series (such as df['col']). >> >> Also like most scikit-learn transformers, SimpleImputer outputs a NumPy array. If you need the output to be a DataFrame, one option is to convert the array to a pandas object and concatenate it to the original DataFrame. > > Well, I did do that in the `process_column` helper function in the code I linked to above. But it kind of felt like... What am I using a framework for to begin with? Because that kind of logistics is the reason I want to use a framework instead of managing my own arrays and imputing logic. > > Thanks for your help Kevin. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ram at rachum.com Wed Aug 19 03:35:46 2020 From: ram at rachum.com (Ram Rachum) Date: Wed, 19 Aug 2020 10:35:46 +0300 Subject: [scikit-learn] Imputers and DataFrame objects In-Reply-To: References: Message-ID: I'll check it out. Thank you. On Wed, Aug 19, 2020 at 9:46 AM Sole Galli via scikit-learn < scikit-learn at python.org> wrote: > Did you have a look at the package feature-engine? It has its own imputers > and encoders that allow you to select the columns to transform and returns > a dataframe. It also has a sklear wrapper that wraps sklearn transformers > so that they return a dataframe instead of a numpy array. > > Cheers. > > Sole > > > Sent from ProtonMail mobile > > > > -------- Original Message -------- > On 18 Aug 2020, 13:56, Ram Rachum < ram at rachum.com> wrote: > > > > > On Mon, Aug 17, 2020 at 8:55 PM Kevin Markham wrote: > >> Hi Ram, >> >> These are great questions! >> > > Thank you for the detailed answers. > >> >> > The task was to remove these irregularities. So for the "?" items, >> replace them with mean, and for the "one", "two" etc. replace with a >> numerical value. >> >> If your primary task is "data cleaning", then pandas is usually the >> optimal tool. If "preprocessing your data for Machine Learning" is your >> primary task, then scikit-learn is usually the optimal tool. There is some >> overlap between what is considered "cleaning" and "preprocessing", but I >> mention this distinction because it can help you decide what tool to use. >> > > Okay, but here's one example where it gets tricky. For a column with > numbers written like "one", "two" and missing values "?", I had to do two > things: Change them to numbers (1, 2), and then, instead of the missing > values, add the most common element, or mean or whatever. When I tried to > use LabelEncoder to do the first part, it complained about the missing > values. I couldn't fix these missing values until the labels were changed > to ints. So that put me in a frustrating Catch-22 situation, and all the > while I'm thinking "It would be so much simpler to just write my own logic > in a for-loop rather than try to get Pandas and scikit-learn working > together. > > Any insights about that? > > >> > For one, I couldn't figure out how to apply SimpleImputer on just one >> column in the DataFrame, and then get the results in the form of a >> dataframe. >> >> Like most scikit-learn transformers, SimpleImputer expects 2-dimensional >> input. In your case, this would be a 1-column DataFrame (such as >> df[['col']]) rather than a Series (such as df['col']). >> >> Also like most scikit-learn transformers, SimpleImputer outputs a NumPy >> array. If you need the output to be a DataFrame, one option is to convert >> the array to a pandas object and concatenate it to the original DataFrame. >> > > Well, I did do that in the `process_column` helper function in the code I > linked to above. But it kind of felt like... What am I using a framework > for to begin with? Because that kind of logistics is the reason I want to > use a framework instead of managing my own arrays and imputing logic. > > Thanks for your help Kevin. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From info at martin-thoma.de Fri Aug 21 10:05:19 2020 From: info at martin-thoma.de (Martin Thoma) Date: Fri, 21 Aug 2020 16:05:19 +0200 Subject: [scikit-learn] Python Version Support Policy Message-ID: Hi :-) Did you discuss at some point a policy which Python versions you want to support? I see that scikit-learn supports at the moment 3.6, 3.7, and 3.8 (badge in README). In a couple of weeks (October?) there will be 3.9, but I don't see any issue opened discussing 3.9 A minor python increment (e.g. 3.8 -> 3.9) will now happen every year, IIRC. So maybe it would be nice to just say "we support the latest 3 Python versions". In the very least, this would mean that wheels are published and that the CI pipeline is run for the newer versions. I think it would also be nice to "deprecate" old features and drop support explicitly. For example, if scikit-learn dropped support for 3.6 we could use future annotations. The implication of such a drop of support would be a major version increase. Best regards, Martin -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Fri Aug 21 10:41:51 2020 From: adrin.jalali at gmail.com (Adrin) Date: Fri, 21 Aug 2020 16:41:51 +0200 Subject: [scikit-learn] Python Version Support Policy In-Reply-To: References: Message-ID: I'd be in favor of whatever the conclusion of the same question on the scipy-dev thread be ( https://mail.python.org/pipermail/scipy-dev/2020-August/024318.html) which seems to be dropping 3.6 for the next release. But I don't think we're going to be supporting only 3 latest releases especially since Python has moved to the annual release cycle. On Fri, Aug 21, 2020 at 4:23 PM Martin Thoma wrote: > Hi :-) > > Did you discuss at some point a policy which Python versions you want to > support? I see that scikit-learn supports at the moment 3.6, 3.7, and 3.8 > (badge in README). In a couple of weeks (October?) there will be 3.9, but I > don't see any issue opened discussing 3.9 > > A minor python increment (e.g. 3.8 -> 3.9) will now happen every year, > IIRC. So maybe it would be nice to just say "we support the latest 3 Python > versions". In the very least, this would mean that wheels are published and > that the CI pipeline is run for the newer versions. > > I think it would also be nice to "deprecate" old features and drop support > explicitly. For example, if scikit-learn dropped support for 3.6 we could > use future annotations. The implication of such a drop of support would be > a major version increase. > > Best regards, > Martin > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jayantb1019 at gmail.com Mon Aug 24 22:03:08 2020 From: jayantb1019 at gmail.com (Jayanth B) Date: Tue, 25 Aug 2020 07:33:08 +0530 Subject: [scikit-learn] Feature Request Message-ID: Is there a class / module for Hierarchical Divisive Clustering in sci-kit learn ? -- Jayanth Boddu | 7710092075 -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Tue Aug 25 04:06:49 2020 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Tue, 25 Aug 2020 10:06:49 +0200 Subject: [scikit-learn] Feature Request In-Reply-To: References: Message-ID: In scikit-learn, you have the agglomerative approach (bottom-up): https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering On Tue, 25 Aug 2020 at 04:06, Jayanth B wrote: > Is there a class / module for Hierarchical Divisive Clustering in sci-kit > learn ? > -- > Jayanth Boddu | 7710092075 > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From marmochiaskl at gmail.com Fri Aug 28 18:42:09 2020 From: marmochiaskl at gmail.com (Chiara Marmo) Date: Sat, 29 Aug 2020 00:42:09 +0200 Subject: [scikit-learn] scikit-learn monthly meeting August 31st Message-ID: Dear list, The next scikit-learn monthly meeting will take place on Monday August 31st at 12PM UTC: https://www.timeanddate.com/worldclock/meetingdetails.html?year=2020&month=8&day=31&hour=12&min=0&sec=0&p1=240&p2=33&p3=37&p4=179&p5=195 While these meetings are mainly for core-devs to discuss the current topics, we are also happy to welcome non-core devs and other project maintainers. Feel free to join, using the following link: https://meet.google.com/xhq-yoga-rtf If you plan to attend and you would like to discuss something specific about your contribution please add your name (or github pseudo) in the " Contributors " section, of the public pad: https://hackmd.io/AuqfmgwvTf-bFz60yjVG1g Best, Chiara -------------- next part -------------- An HTML attachment was scrubbed... URL: From prashantasaha at montana.edu Fri Aug 28 19:10:23 2020 From: prashantasaha at montana.edu (Saha, Prashanta) Date: Fri, 28 Aug 2020 23:10:23 +0000 Subject: [scikit-learn] Generating biased dataset using make_classification Message-ID: To generate a biased dataset using the make_classfication method which parameter is needed to be used? Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ivomarbsoares at gmail.com Sat Aug 29 13:38:33 2020 From: ivomarbsoares at gmail.com (Ivomar Brito Soares) Date: Sat, 29 Aug 2020 14:38:33 -0300 Subject: [scikit-learn] New Contributor: Greetings from Brazil Message-ID: Hello everyone, My name is Ivomar and I am a machine learning engineer | data scientist from Brazil. I have been working with machine learning since 2013 and I am a frequent user of scikit-learn. I decided to start contributing to scikit-learn to deepen my knowledge of machine learning. I am very happy to be part of this community and hope to be interacting with you all. Best regards, -- Ivomar Brito Soares https://www.linkedin.com/in/ivomar-brito-soares-26b3b9151/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From cwjbrian50 at gmail.com Sun Aug 30 01:36:23 2020 From: cwjbrian50 at gmail.com (=?UTF-8?B?7LWc7Jqw7KCV?=) Date: Sun, 30 Aug 2020 14:36:23 +0900 Subject: [scikit-learn] Some typos in documentation and errors Message-ID: To whom it may concern I am a good user of scikit-learn. First of all, I am grateful to scikit-learn for providing a good service. The reason I write an email is to report some typos. While studying about AUC, I think I found some typos in API documentation. As it is written in " https://scikit-learn.org/stable/modules/model_evaluation.html#roc-metrics" One-vs-one Algorithm, the multiclass macro AUC metric was defined in the reference [HT2001]. But there is a double difference in macro ovo AUC between the reference and documentation. Furthermore, under the macro AUC, there is weighted AUC which is defined in the reference [FC2009] as it is written in documentation. But there is no same metric in the documentation in the reference however there is a similar one, AU1P. After reviewing the code which includes the definition of roc_auc_score, I notice that it is different from both of the expressions in the documentation and the reference. Additionally, my scikit-learn version is 0.23.1. I hope this part will be fixed well. Thank you for reading my email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexandre.gramfort at inria.fr Sun Aug 30 08:19:40 2020 From: alexandre.gramfort at inria.fr (Alexandre Gramfort) Date: Sun, 30 Aug 2020 14:19:40 +0200 Subject: [scikit-learn] Some typos in documentation and errors In-Reply-To: References: Message-ID: Hi, the best way to get this fixed is to send us a PR updating this file: https://github.com/scikit-learn/scikit-learn/blob/master/doc/modules/model_evaluation.rst thanks for your help Alex On Sun, Aug 30, 2020 at 7:38 AM ??? wrote: > > To whom it may concern > > I am a good user of scikit-learn. First of all, I am grateful to scikit-learn for providing a good service. The reason I write an email is to report some typos. > > While studying about AUC, I think I found some typos in API documentation. As it is written in "https://scikit-learn.org/stable/modules/model_evaluation.html#roc-metrics" One-vs-one Algorithm, the multiclass macro AUC metric was defined in the reference [HT2001]. But there is a double difference in macro ovo AUC between the reference and documentation. Furthermore, under the macro AUC, there is weighted AUC which is defined in the reference [FC2009] as it is written in documentation. But there is no same metric in the documentation in the reference however there is a similar one, AU1P. After reviewing the code which includes the definition of roc_auc_score, I notice that it is different from both of the expressions in the documentation and the reference. Additionally, my scikit-learn version is 0.23.1. I hope this part will be fixed well. Thank you for reading my email. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From g.lemaitre58 at gmail.com Mon Aug 31 04:21:33 2020 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Mon, 31 Aug 2020 10:21:33 +0200 Subject: [scikit-learn] Generating biased dataset using make_classification In-Reply-To: References: Message-ID: It depends on which type of biases you want to induce. I would think that the current function is pretty limited to introduce biases thought. On Sat, 29 Aug 2020 at 01:12, Saha, Prashanta wrote: > To generate a biased dataset using the make_classfication method which > parameter is needed to be used? > Thanks. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Mon Aug 31 04:34:09 2020 From: adrin.jalali at gmail.com (Adrin) Date: Mon, 31 Aug 2020 10:34:09 +0200 Subject: [scikit-learn] New Contributor: Greetings from Brazil In-Reply-To: References: Message-ID: Hi, Hope you enjoy contributing and stick around :) Happy learning. Adrin On Sat, Aug 29, 2020 at 7:42 PM Ivomar Brito Soares wrote: > Hello everyone, > > My name is Ivomar and I am a machine learning engineer | data scientist > from Brazil. I have been working with machine learning since 2013 and I am > a frequent user of scikit-learn. I decided to start contributing to > scikit-learn to deepen my knowledge of machine learning. I am very happy to > be part of this community and hope to be interacting with you all. > > Best regards, > -- > Ivomar Brito Soares > https://www.linkedin.com/in/ivomar-brito-soares-26b3b9151/ > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: