From randy.heiland at gmail.com Tue Oct 1 19:33:32 2019 From: randy.heiland at gmail.com (Randy Heiland) Date: Tue, 1 Oct 2019 19:33:32 -0400 Subject: [scikit-learn] AffinityProp to classify 2D points Message-ID: This is surely a well-studied problem, but I'm enjoying just playing with it for now. I have a bunch of 2D points (they are actually circles with possibly varying radii... later) and I'd like to devise a metric of sorts to quantify their arrangement. At first I was thinking K-means, but I don't know how many clusters there might be. So I began playing with AffinityPropagation (for my first time). The results weren't exactly what I was expecting, and I was wondering what parameters I should tweak to get different results? In the 2 sample datasets/outcomes at https://github.com/rheiland/PhysiCell_tools/tree/master/cell_metrics, I have what I call "uniform" and "clumpy". Can someone offer a general explanation of why they both have ~25 clusters? I'm probably making false assumptions about the AP alg. Thanks for any insights. Next, I'll probably explore some image processing algs and graph algs. But I'd welcome other ideas. Randy -------------- next part -------------- An HTML attachment was scrubbed... URL: From tmrsg11 at gmail.com Fri Oct 4 12:48:24 2019 From: tmrsg11 at gmail.com (C W) Date: Fri, 4 Oct 2019 12:48:24 -0400 Subject: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features? In-Reply-To: References: Message-ID: I'm getting some funny results. I am doing a regression decision tree, the response variables are assigned to levels. The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, Audi=2) as numerical values, not category. The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How does the sklearn know internally 0 vs. 1 is categorical, not numerical? In R for instance, you do as.factor(), which explicitly states the data type. Thank you! On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller wrote: > > > On 9/15/19 8:16 AM, Guillaume Lema?tre wrote: > > > > On Sat, 14 Sep 2019 at 20:59, C W wrote: > >> Thanks, Guillaume. >> Column transformer looks pretty neat. I've also heard though, this >> pipeline can be tedious to set up? Specifying what you want for every >> feature is a pain. >> > > It would be interesting for us which part of the pipeline is tedious to > set up to know if we can improve something there. > Do you mean, that you would like to automatically detect of which type of > feature (categorical/numerical) and apply a > default encoder/scaling such as discuss there: > https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 > > IMO, one a user perspective, it would be cleaner in some cases at the cost > of applying blindly a black box > which might be dangerous. > > Also see > https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor > Which basically does that. > > > > >> >> Jaiver, >> Actually, you guessed right. My real data has only one numerical >> variable, looks more like this: >> >> Gender Date Income Car Attendance >> Male 2019/3/01 10000 BMW Yes >> Female 2019/5/02 9000 Toyota No >> Male 2019/7/15 12000 Audi Yes >> >> I am predicting income using all other categorical variables. Maybe it is >> catboost! >> >> Thanks, >> >> M >> >> >> >> >> >> >> On Sat, Sep 14, 2019 at 9:25 AM Javier L?pez >> wrote: >> >>> If you have datasets with many categorical features, and perhaps many >>> categories, the tools in sklearn are quite limited, >>> but there are alternative implementations of boosted trees that are >>> designed with categorical features in mind. Take a look >>> at catboost [1], which has an sklearn-compatible API. >>> >>> J >>> >>> [1] https://catboost.ai/ >>> >>> On Sat, Sep 14, 2019 at 3:40 AM C W wrote: >>> >>>> Hello all, >>>> I'm very confused. Can the decision tree module handle both continuous >>>> and categorical features in the dataset? In this case, it's just CART >>>> (Classification and Regression Trees). >>>> >>>> For example, >>>> Gender Age Income Car Attendance >>>> Male 30 10000 BMW Yes >>>> Female 35 9000 Toyota No >>>> Male 50 12000 Audi Yes >>>> >>>> According to the documentation >>>> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart, >>>> it can not! >>>> >>>> It says: "scikit-learn implementation does not support categorical >>>> variables for now". >>>> >>>> Is this true? If not, can someone point me to an example? If yes, what >>>> do people do? >>>> >>>> Thank you very much! >>>> >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Fri Oct 4 13:03:17 2019 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Fri, 4 Oct 2019 12:03:17 -0500 Subject: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features? In-Reply-To: References: Message-ID: Hi, > The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5 that's not a onehot encoding then. For an Audi datapoint, it should be BMW=0 Toyota=0 Audi=1 for BMW BMW=1 Toyota=0 Audi=0 and for Toyota BMW=0 Toyota=1 Audi=0 The split threshold should then be at 0.5 for any of these features. Based on your email, I think you were assuming that the DT does the one-hot encoding internally, which it doesn't. In practice, it is hard to guess what is a nominal and what is a ordinal variable, so you have to do the onehot encoding before you give the data to the decision tree. Best, Sebastian > On Oct 4, 2019, at 11:48 AM, C W wrote: > > I'm getting some funny results. I am doing a regression decision tree, the response variables are assigned to levels. > > The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, Audi=2) as numerical values, not category. > > The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How does the sklearn know internally 0 vs. 1 is categorical, not numerical? > > In R for instance, you do as.factor(), which explicitly states the data type. > > Thank you! > > > On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller > wrote: > > > On 9/15/19 8:16 AM, Guillaume Lema?tre wrote: >> >> >> On Sat, 14 Sep 2019 at 20:59, C W > wrote: >> Thanks, Guillaume. >> Column transformer looks pretty neat. I've also heard though, this pipeline can be tedious to set up? Specifying what you want for every feature is a pain. >> >> It would be interesting for us which part of the pipeline is tedious to set up to know if we can improve something there. >> Do you mean, that you would like to automatically detect of which type of feature (categorical/numerical) and apply a >> default encoder/scaling such as discuss there: https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 >> >> IMO, one a user perspective, it would be cleaner in some cases at the cost of applying blindly a black box >> which might be dangerous. > Also see https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor > Which basically does that. > > >> >> >> Jaiver, >> Actually, you guessed right. My real data has only one numerical variable, looks more like this: >> >> Gender Date Income Car Attendance >> Male 2019/3/01 10000 BMW Yes >> Female 2019/5/02 9000 Toyota No >> Male 2019/7/15 12000 Audi Yes >> >> I am predicting income using all other categorical variables. Maybe it is catboost! >> >> Thanks, >> >> M >> >> >> >> >> >> >> On Sat, Sep 14, 2019 at 9:25 AM Javier L?pez wrote: >> If you have datasets with many categorical features, and perhaps many categories, the tools in sklearn are quite limited, >> but there are alternative implementations of boosted trees that are designed with categorical features in mind. Take a look >> at catboost [1], which has an sklearn-compatible API. >> >> J >> >> [1] https://catboost.ai/ >> On Sat, Sep 14, 2019 at 3:40 AM C W > wrote: >> Hello all, >> I'm very confused. Can the decision tree module handle both continuous and categorical features in the dataset? In this case, it's just CART (Classification and Regression Trees). >> >> For example, >> Gender Age Income Car Attendance >> Male 30 10000 BMW Yes >> Female 35 9000 Toyota No >> Male 50 12000 Audi Yes >> >> According to the documentation https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart , it can not! >> >> It says: "scikit-learn implementation does not support categorical variables for now". >> >> Is this true? If not, can someone point me to an example? If yes, what do people do? >> >> Thank you very much! >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> -- >> Guillaume Lemaitre >> INRIA Saclay - Parietal team >> Center for Data Science Paris-Saclay >> https://glemaitre.github.io/ >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From tmrsg11 at gmail.com Fri Oct 4 14:01:23 2019 From: tmrsg11 at gmail.com (C W) Date: Fri, 4 Oct 2019 14:01:23 -0400 Subject: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features? In-Reply-To: References: Message-ID: Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my part. Looks like I did one-hot-encoding correctly. My new variable names are: car_Audi, car_BMW, etc. But, decision tree is still mistaking one-hot-encoding as numerical input and split at 0.5. This is not right. Perhaps, I'm doing something wrong? Is there a good toy example on the sklearn website? I am only see this: https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html . Thanks! On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka wrote: > Hi, > > The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, > Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5 > > > that's not a onehot encoding then. > > For an Audi datapoint, it should be > > BMW=0 > Toyota=0 > Audi=1 > > for BMW > > BMW=1 > Toyota=0 > Audi=0 > > and for Toyota > > BMW=0 > Toyota=1 > Audi=0 > > The split threshold should then be at 0.5 for any of these features. > > Based on your email, I think you were assuming that the DT does the > one-hot encoding internally, which it doesn't. In practice, it is hard to > guess what is a nominal and what is a ordinal variable, so you have to do > the onehot encoding before you give the data to the decision tree. > > Best, > Sebastian > > On Oct 4, 2019, at 11:48 AM, C W wrote: > > I'm getting some funny results. I am doing a regression decision tree, the > response variables are assigned to levels. > > The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, > Audi=2) as numerical values, not category. > > The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How > does the sklearn know internally 0 vs. 1 is categorical, not numerical? > > In R for instance, you do as.factor(), which explicitly states the data > type. > > Thank you! > > > On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller wrote: > >> >> >> On 9/15/19 8:16 AM, Guillaume Lema?tre wrote: >> >> >> >> On Sat, 14 Sep 2019 at 20:59, C W wrote: >> >>> Thanks, Guillaume. >>> Column transformer looks pretty neat. I've also heard though, this >>> pipeline can be tedious to set up? Specifying what you want for every >>> feature is a pain. >>> >> >> It would be interesting for us which part of the pipeline is tedious to >> set up to know if we can improve something there. >> Do you mean, that you would like to automatically detect of which type of >> feature (categorical/numerical) and apply a >> default encoder/scaling such as discuss there: >> https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 >> >> IMO, one a user perspective, it would be cleaner in some cases at the >> cost of applying blindly a black box >> which might be dangerous. >> >> Also see >> https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor >> Which basically does that. >> >> >> >> >>> >>> Jaiver, >>> Actually, you guessed right. My real data has only one numerical >>> variable, looks more like this: >>> >>> Gender Date Income Car Attendance >>> Male 2019/3/01 10000 BMW Yes >>> Female 2019/5/02 9000 Toyota No >>> Male 2019/7/15 12000 Audi Yes >>> >>> I am predicting income using all other categorical variables. Maybe it >>> is catboost! >>> >>> Thanks, >>> >>> M >>> >>> >>> >>> >>> >>> >>> On Sat, Sep 14, 2019 at 9:25 AM Javier L?pez >>> wrote: >>> >>>> If you have datasets with many categorical features, and perhaps many >>>> categories, the tools in sklearn are quite limited, >>>> but there are alternative implementations of boosted trees that are >>>> designed with categorical features in mind. Take a look >>>> at catboost [1], which has an sklearn-compatible API. >>>> >>>> J >>>> >>>> [1] https://catboost.ai/ >>>> >>>> On Sat, Sep 14, 2019 at 3:40 AM C W wrote: >>>> >>>>> Hello all, >>>>> I'm very confused. Can the decision tree module handle both continuous >>>>> and categorical features in the dataset? In this case, it's just CART >>>>> (Classification and Regression Trees). >>>>> >>>>> For example, >>>>> Gender Age Income Car Attendance >>>>> Male 30 10000 BMW Yes >>>>> Female 35 9000 Toyota No >>>>> Male 50 12000 Audi Yes >>>>> >>>>> According to the documentation >>>>> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart, >>>>> it can not! >>>>> >>>>> It says: "scikit-learn implementation does not support categorical >>>>> variables for now". >>>>> >>>>> Is this true? If not, can someone point me to an example? If yes, what >>>>> do people do? >>>>> >>>>> Thank you very much! >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> -- >> Guillaume Lemaitre >> INRIA Saclay - Parietal team >> Center for Data Science Paris-Saclay >> https://glemaitre.github.io/ >> >> _______________________________________________ >> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From niourf at gmail.com Fri Oct 4 14:44:04 2019 From: niourf at gmail.com (Nicolas Hug) Date: Fri, 4 Oct 2019 14:44:04 -0400 Subject: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features? In-Reply-To: References: Message-ID: <5e9661ff-dfb2-cc2e-b71f-ba18024374a1@gmail.com> > But, decision tree is still mistaking one-hot-encoding as numerical > input and split at 0.5. This is not right. Perhaps, I'm doing > something wrong? You're not doing anything wrong, and neither is the tree. Trees don't support categorical variables in sklearn, so everything is treated as numerical. This is why we do one-hot-encoding: so that a set of numerical (one hot encoded) features can be treated as if they were just one categorical feature. Nicolas On 10/4/19 2:01 PM, C W wrote: > Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on > my part. > > Looks like I did one-hot-encoding correctly. My new variable names > are: car_Audi, car_BMW, etc. > > But, decision tree is still mistaking one-hot-encoding as numerical > input and split at 0.5. This is not right. Perhaps, I'm doing > something wrong? > > Is there a good toy example on the sklearn website? I am only see > this: > https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html. > > Thanks! > > > > On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka > > wrote: > > Hi, > >> The funny part is: the tree is taking one-hot-encoding (BMW=0, >> Toyota=1, Audi=2) as numerical values, not category.The tree >> splits at 0.5 and 1.5 > > that's not a onehot encoding then. > > For an Audi datapoint, it should be > > BMW=0 > Toyota=0 > Audi=1 > > for BMW > > BMW=1 > Toyota=0 > Audi=0 > > and for Toyota > > BMW=0 > Toyota=1 > Audi=0 > > The split threshold should then be at 0.5 for any of these features. > > Based on your email, I think you were assuming that the DT does > the one-hot encoding internally, which it doesn't. In practice, it > is hard to guess what is a nominal and what is a ordinal variable, > so you have to do the onehot encoding before you give the data to > the decision tree. > > Best, > Sebastian > >> On Oct 4, 2019, at 11:48 AM, C W > > wrote: >> >> I'm getting some funny results. I am doing a regression decision >> tree, the response variables are assigned to levels. >> >> The funny part is: the tree is taking one-hot-encoding (BMW=0, >> Toyota=1, Audi=2) as numerical values, not category. >> >> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding >> wrong? How does the sklearn know internally 0 vs. 1 is >> categorical, not numerical? >> >> In R for instance, you do as.factor(), which explicitly states >> the data type. >> >> Thank you! >> >> >> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller >> > wrote: >> >> >> >> On 9/15/19 8:16 AM, Guillaume Lema?tre wrote: >>> >>> >>> On Sat, 14 Sep 2019 at 20:59, C W >> > wrote: >>> >>> Thanks,?Guillaume. >>> Column transformer looks pretty neat. I've also heard >>> though, this pipeline can be tedious to set up? >>> Specifying what you want for every feature is a pain. >>> >>> >>> It would be interesting for us which part of the pipeline is >>> tedious to set up to know if we can improve something there. >>> Do you mean, that you would like to automatically detect of >>> which type of feature (categorical/numerical) and apply a >>> default encoder/scaling such as discuss there: >>> https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 >>> >>> IMO, one a user perspective, it would be cleaner in some >>> cases at the cost of applying blindly a black box >>> which might be dangerous. >> Also see >> https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor >> Which basically does that. >> >> >>> >>> Jaiver, >>> Actually, you guessed right. My real data has only one >>> numerical variable,?looks more like this: >>> >>> Gender Date Income? Car?? Attendance >>> Male? ? ?2019/3/01? ?10000 BMW????????? Yes >>> Female 2019/5/02? ? 9000 ?Toyota? ??????? No >>> Male???? 2019/7/15? ?12000 Audi ? ? ????? Yes >>> >>> I am predicting income using all other categorical >>> variables. Maybe?it is catboost! >>> >>> Thanks, >>> >>> M >>> >>> >>> >>> >>> >>> >>> On Sat, Sep 14, 2019 at 9:25 AM Javier L?pez >>> wrote: >>> >>> If you have datasets with many categorical features, >>> and perhaps many categories, the tools in sklearn >>> are quite limited, >>> but there are alternative implementations of boosted >>> trees that are designed with categorical features in >>> mind. Take a look >>> at catboost [1], which has an sklearn-compatible API. >>> >>> J >>> >>> [1] https://catboost.ai/ >>> >>> On Sat, Sep 14, 2019 at 3:40 AM C W >>> > wrote: >>> >>> Hello all, >>> I'm very confused. Can the decision tree module >>> handle both continuous and categorical features >>> in the dataset? In this case, it's just CART >>> (Classification and Regression Trees). >>> >>> For example, >>> Gender Age Income Car?? Attendance >>> Male???? 30?? 10000 BMW????????? Yes >>> Female 35???? 9000 Toyota? ??????? No >>> Male???? 50?? 12000 Audi ? ? ????? Yes >>> >>> According to the documentation >>> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart, >>> it can not! >>> >>> It says: "scikit-learn implementation does not >>> support categorical variables for now". >>> >>> Is this true? If not, can someone point me to an >>> example? If yes, what do people do? >>> >>> Thank you very much! >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> >>> -- >>> Guillaume Lemaitre >>> INRIA Saclay - Parietal team >>> Center for Data Science Paris-Saclay >>> https://glemaitre.github.io/ >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Fri Oct 4 15:35:42 2019 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Fri, 4 Oct 2019 14:35:42 -0500 Subject: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features? In-Reply-To: <5e9661ff-dfb2-cc2e-b71f-ba18024374a1@gmail.com> References: <5e9661ff-dfb2-cc2e-b71f-ba18024374a1@gmail.com> Message-ID: <7E3EE86D-4B8A-438A-B03A-8DFC8E1D8AB4@sebastianraschka.com> Like Nicolas said, the 0.5 is just a workaround but will do the right thing on the one-hot encoded variables, here. You will find that the threshold is always at 0.5 for these variables. I.e., what it will do is to use the following conversion: treat as car_Audi=1 if car_Audi >= 0.5 treat as car_Audi=0 if car_Audi < 0.5 or, it may be treat as car_Audi=1 if car_Audi > 0.5 treat as car_Audi=0 if car_Audi <= 0.5 (Forgot which one sklearn is using, but either way. it will be fine.) Best, Sebastian > On Oct 4, 2019, at 1:44 PM, Nicolas Hug wrote: > > >> But, decision tree is still mistaking one-hot-encoding as numerical input and split at 0.5. This is not right. Perhaps, I'm doing something wrong? > > You're not doing anything wrong, and neither is the tree. Trees don't support categorical variables in sklearn, so everything is treated as numerical. > > This is why we do one-hot-encoding: so that a set of numerical (one hot encoded) features can be treated as if they were just one categorical feature. > > > > Nicolas > > On 10/4/19 2:01 PM, C W wrote: >> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my part. >> >> Looks like I did one-hot-encoding correctly. My new variable names are: car_Audi, car_BMW, etc. >> >> But, decision tree is still mistaking one-hot-encoding as numerical input and split at 0.5. This is not right. Perhaps, I'm doing something wrong? >> >> Is there a good toy example on the sklearn website? I am only see this: https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html . >> >> Thanks! >> >> >> >> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka > wrote: >> Hi, >> >>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5 >> >> that's not a onehot encoding then. >> >> For an Audi datapoint, it should be >> >> BMW=0 >> Toyota=0 >> Audi=1 >> >> for BMW >> >> BMW=1 >> Toyota=0 >> Audi=0 >> >> and for Toyota >> >> BMW=0 >> Toyota=1 >> Audi=0 >> >> The split threshold should then be at 0.5 for any of these features. >> >> Based on your email, I think you were assuming that the DT does the one-hot encoding internally, which it doesn't. In practice, it is hard to guess what is a nominal and what is a ordinal variable, so you have to do the onehot encoding before you give the data to the decision tree. >> >> Best, >> Sebastian >> >>> On Oct 4, 2019, at 11:48 AM, C W > wrote: >>> >>> I'm getting some funny results. I am doing a regression decision tree, the response variables are assigned to levels. >>> >>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, Audi=2) as numerical values, not category. >>> >>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How does the sklearn know internally 0 vs. 1 is categorical, not numerical? >>> >>> In R for instance, you do as.factor(), which explicitly states the data type. >>> >>> Thank you! >>> >>> >>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller > wrote: >>> >>> >>> On 9/15/19 8:16 AM, Guillaume Lema?tre wrote: >>>> >>>> >>>> On Sat, 14 Sep 2019 at 20:59, C W > wrote: >>>> Thanks, Guillaume. >>>> Column transformer looks pretty neat. I've also heard though, this pipeline can be tedious to set up? Specifying what you want for every feature is a pain. >>>> >>>> It would be interesting for us which part of the pipeline is tedious to set up to know if we can improve something there. >>>> Do you mean, that you would like to automatically detect of which type of feature (categorical/numerical) and apply a >>>> default encoder/scaling such as discuss there: https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 >>>> >>>> IMO, one a user perspective, it would be cleaner in some cases at the cost of applying blindly a black box >>>> which might be dangerous. >>> Also see https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor >>> Which basically does that. >>> >>> >>>> >>>> >>>> Jaiver, >>>> Actually, you guessed right. My real data has only one numerical variable, looks more like this: >>>> >>>> Gender Date Income Car Attendance >>>> Male 2019/3/01 10000 BMW Yes >>>> Female 2019/5/02 9000 Toyota No >>>> Male 2019/7/15 12000 Audi Yes >>>> >>>> I am predicting income using all other categorical variables. Maybe it is catboost! >>>> >>>> Thanks, >>>> >>>> M >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Sat, Sep 14, 2019 at 9:25 AM Javier L?pez wrote: >>>> If you have datasets with many categorical features, and perhaps many categories, the tools in sklearn are quite limited, >>>> but there are alternative implementations of boosted trees that are designed with categorical features in mind. Take a look >>>> at catboost [1], which has an sklearn-compatible API. >>>> >>>> J >>>> >>>> [1] https://catboost.ai/ >>>> On Sat, Sep 14, 2019 at 3:40 AM C W > wrote: >>>> Hello all, >>>> I'm very confused. Can the decision tree module handle both continuous and categorical features in the dataset? In this case, it's just CART (Classification and Regression Trees). >>>> >>>> For example, >>>> Gender Age Income Car Attendance >>>> Male 30 10000 BMW Yes >>>> Female 35 9000 Toyota No >>>> Male 50 12000 Audi Yes >>>> >>>> According to the documentation https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart , it can not! >>>> >>>> It says: "scikit-learn implementation does not support categorical variables for now". >>>> >>>> Is this true? If not, can someone point me to an example? If yes, what do people do? >>>> >>>> Thank you very much! >>>> >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>>> -- >>>> Guillaume Lemaitre >>>> INRIA Saclay - Parietal team >>>> Center for Data Science Paris-Saclay >>>> https://glemaitre.github.io/ >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From tmrsg11 at gmail.com Fri Oct 4 18:34:50 2019 From: tmrsg11 at gmail.com (C W) Date: Fri, 4 Oct 2019 18:34:50 -0400 Subject: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features? In-Reply-To: <7E3EE86D-4B8A-438A-B03A-8DFC8E1D8AB4@sebastianraschka.com> References: <5e9661ff-dfb2-cc2e-b71f-ba18024374a1@gmail.com> <7E3EE86D-4B8A-438A-B03A-8DFC8E1D8AB4@sebastianraschka.com> Message-ID: I don't understand your answer. Why after one-hot-encoding it still outputs greater than 0.5 or less than? Does sklearn website have a working example on categorical input? Thanks! On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka wrote: > Like Nicolas said, the 0.5 is just a workaround but will do the right > thing on the one-hot encoded variables, here. You will find that the > threshold is always at 0.5 for these variables. I.e., what it will do is to > use the following conversion: > > treat as car_Audi=1 if car_Audi >= 0.5 > treat as car_Audi=0 if car_Audi < 0.5 > > or, it may be > > treat as car_Audi=1 if car_Audi > 0.5 > treat as car_Audi=0 if car_Audi <= 0.5 > > (Forgot which one sklearn is using, but either way. it will be fine.) > > Best, > Sebastian > > > On Oct 4, 2019, at 1:44 PM, Nicolas Hug wrote: > > > But, decision tree is still mistaking one-hot-encoding as numerical input > and split at 0.5. This is not right. Perhaps, I'm doing something wrong? > > > You're not doing anything wrong, and neither is the tree. Trees don't > support categorical variables in sklearn, so everything is treated as > numerical. > > This is why we do one-hot-encoding: so that a set of numerical (one hot > encoded) features can be treated as if they were just one categorical > feature. > > > Nicolas > On 10/4/19 2:01 PM, C W wrote: > > Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my > part. > > Looks like I did one-hot-encoding correctly. My new variable names are: > car_Audi, car_BMW, etc. > > But, decision tree is still mistaking one-hot-encoding as numerical input > and split at 0.5. This is not right. Perhaps, I'm doing something wrong? > > Is there a good toy example on the sklearn website? I am only see this: > https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html > . > > Thanks! > > > > On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka < > mail at sebastianraschka.com> wrote: > >> Hi, >> >> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, >> Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5 >> >> >> that's not a onehot encoding then. >> >> For an Audi datapoint, it should be >> >> BMW=0 >> Toyota=0 >> Audi=1 >> >> for BMW >> >> BMW=1 >> Toyota=0 >> Audi=0 >> >> and for Toyota >> >> BMW=0 >> Toyota=1 >> Audi=0 >> >> The split threshold should then be at 0.5 for any of these features. >> >> Based on your email, I think you were assuming that the DT does the >> one-hot encoding internally, which it doesn't. In practice, it is hard to >> guess what is a nominal and what is a ordinal variable, so you have to do >> the onehot encoding before you give the data to the decision tree. >> >> Best, >> Sebastian >> >> On Oct 4, 2019, at 11:48 AM, C W wrote: >> >> I'm getting some funny results. I am doing a regression decision tree, >> the response variables are assigned to levels. >> >> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, >> Audi=2) as numerical values, not category. >> >> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How >> does the sklearn know internally 0 vs. 1 is categorical, not numerical? >> >> In R for instance, you do as.factor(), which explicitly states the data >> type. >> >> Thank you! >> >> >> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller >> wrote: >> >>> >>> >>> On 9/15/19 8:16 AM, Guillaume Lema?tre wrote: >>> >>> >>> >>> On Sat, 14 Sep 2019 at 20:59, C W wrote: >>> >>>> Thanks, Guillaume. >>>> Column transformer looks pretty neat. I've also heard though, this >>>> pipeline can be tedious to set up? Specifying what you want for every >>>> feature is a pain. >>>> >>> >>> It would be interesting for us which part of the pipeline is tedious to >>> set up to know if we can improve something there. >>> Do you mean, that you would like to automatically detect of which type >>> of feature (categorical/numerical) and apply a >>> default encoder/scaling such as discuss there: >>> https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 >>> >>> IMO, one a user perspective, it would be cleaner in some cases at the >>> cost of applying blindly a black box >>> which might be dangerous. >>> >>> Also see >>> https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor >>> Which basically does that. >>> >>> >>> >>> >>>> >>>> Jaiver, >>>> Actually, you guessed right. My real data has only one numerical >>>> variable, looks more like this: >>>> >>>> Gender Date Income Car Attendance >>>> Male 2019/3/01 10000 BMW Yes >>>> Female 2019/5/02 9000 Toyota No >>>> Male 2019/7/15 12000 Audi Yes >>>> >>>> I am predicting income using all other categorical variables. Maybe it >>>> is catboost! >>>> >>>> Thanks, >>>> >>>> M >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Sat, Sep 14, 2019 at 9:25 AM Javier L?pez >>>> wrote: >>>> >>>>> If you have datasets with many categorical features, and perhaps many >>>>> categories, the tools in sklearn are quite limited, >>>>> but there are alternative implementations of boosted trees that are >>>>> designed with categorical features in mind. Take a look >>>>> at catboost [1], which has an sklearn-compatible API. >>>>> >>>>> J >>>>> >>>>> [1] https://catboost.ai/ >>>>> >>>>> On Sat, Sep 14, 2019 at 3:40 AM C W wrote: >>>>> >>>>>> Hello all, >>>>>> I'm very confused. Can the decision tree module handle both >>>>>> continuous and categorical features in the dataset? In this case, it's just >>>>>> CART (Classification and Regression Trees). >>>>>> >>>>>> For example, >>>>>> Gender Age Income Car Attendance >>>>>> Male 30 10000 BMW Yes >>>>>> Female 35 9000 Toyota No >>>>>> Male 50 12000 Audi Yes >>>>>> >>>>>> According to the documentation >>>>>> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart, >>>>>> it can not! >>>>>> >>>>>> It says: "scikit-learn implementation does not support categorical >>>>>> variables for now". >>>>>> >>>>>> Is this true? If not, can someone point me to an example? If yes, >>>>>> what do people do? >>>>>> >>>>>> Thank you very much! >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> >>> >>> -- >>> Guillaume Lemaitre >>> INRIA Saclay - Parietal team >>> Center for Data Science Paris-Saclay >>> https://glemaitre.github.io/ >>> >>> _______________________________________________ >>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Fri Oct 4 18:50:41 2019 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Fri, 4 Oct 2019 17:50:41 -0500 Subject: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features? In-Reply-To: References: <5e9661ff-dfb2-cc2e-b71f-ba18024374a1@gmail.com> <7E3EE86D-4B8A-438A-B03A-8DFC8E1D8AB4@sebastianraschka.com> Message-ID: Not sure if there's a website for that. In any case, to explain this differently, as discussed earlier sklearn assumes continuous features for decision trees. So, it will use a binary threshold for splitting along a feature attribute. In other words, it cannot do sth like if x == 1 then right child node else left child node Instead, what it does is if x >= 0.5 then right child node else left child node These are basically equivalent as you can see when you just plug in values 0 and 1 for x. Best, Sebastian > On Oct 4, 2019, at 5:34 PM, C W wrote: > > I don't understand your answer. > > Why after one-hot-encoding it still outputs greater than 0.5 or less than? Does sklearn website have a working example on categorical input? > > Thanks! > > On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka wrote: > Like Nicolas said, the 0.5 is just a workaround but will do the right thing on the one-hot encoded variables, here. You will find that the threshold is always at 0.5 for these variables. I.e., what it will do is to use the following conversion: > > treat as car_Audi=1 if car_Audi >= 0.5 > treat as car_Audi=0 if car_Audi < 0.5 > > or, it may be > > treat as car_Audi=1 if car_Audi > 0.5 > treat as car_Audi=0 if car_Audi <= 0.5 > > (Forgot which one sklearn is using, but either way. it will be fine.) > > Best, > Sebastian > > >> On Oct 4, 2019, at 1:44 PM, Nicolas Hug wrote: >> >> >>> But, decision tree is still mistaking one-hot-encoding as numerical input and split at 0.5. This is not right. Perhaps, I'm doing something wrong? >> >> You're not doing anything wrong, and neither is the tree. Trees don't support categorical variables in sklearn, so everything is treated as numerical. >> >> This is why we do one-hot-encoding: so that a set of numerical (one hot encoded) features can be treated as if they were just one categorical feature. >> >> >> >> Nicolas >> >> On 10/4/19 2:01 PM, C W wrote: >>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my part. >>> >>> Looks like I did one-hot-encoding correctly. My new variable names are: car_Audi, car_BMW, etc. >>> >>> But, decision tree is still mistaking one-hot-encoding as numerical input and split at 0.5. This is not right. Perhaps, I'm doing something wrong? >>> >>> Is there a good toy example on the sklearn website? I am only see this: https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html. >>> >>> Thanks! >>> >>> >>> >>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka wrote: >>> Hi, >>> >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5 >>> >>> that's not a onehot encoding then. >>> >>> For an Audi datapoint, it should be >>> >>> BMW=0 >>> Toyota=0 >>> Audi=1 >>> >>> for BMW >>> >>> BMW=1 >>> Toyota=0 >>> Audi=0 >>> >>> and for Toyota >>> >>> BMW=0 >>> Toyota=1 >>> Audi=0 >>> >>> The split threshold should then be at 0.5 for any of these features. >>> >>> Based on your email, I think you were assuming that the DT does the one-hot encoding internally, which it doesn't. In practice, it is hard to guess what is a nominal and what is a ordinal variable, so you have to do the onehot encoding before you give the data to the decision tree. >>> >>> Best, >>> Sebastian >>> >>>> On Oct 4, 2019, at 11:48 AM, C W wrote: >>>> >>>> I'm getting some funny results. I am doing a regression decision tree, the response variables are assigned to levels. >>>> >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, Audi=2) as numerical values, not category. >>>> >>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How does the sklearn know internally 0 vs. 1 is categorical, not numerical? >>>> >>>> In R for instance, you do as.factor(), which explicitly states the data type. >>>> >>>> Thank you! >>>> >>>> >>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller wrote: >>>> >>>> >>>> On 9/15/19 8:16 AM, Guillaume Lema?tre wrote: >>>>> >>>>> >>>>> On Sat, 14 Sep 2019 at 20:59, C W wrote: >>>>> Thanks, Guillaume. >>>>> Column transformer looks pretty neat. I've also heard though, this pipeline can be tedious to set up? Specifying what you want for every feature is a pain. >>>>> >>>>> It would be interesting for us which part of the pipeline is tedious to set up to know if we can improve something there. >>>>> Do you mean, that you would like to automatically detect of which type of feature (categorical/numerical) and apply a >>>>> default encoder/scaling such as discuss there: https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 >>>>> >>>>> IMO, one a user perspective, it would be cleaner in some cases at the cost of applying blindly a black box >>>>> which might be dangerous. >>>> Also see https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor >>>> Which basically does that. >>>> >>>> >>>>> >>>>> >>>>> Jaiver, >>>>> Actually, you guessed right. My real data has only one numerical variable, looks more like this: >>>>> >>>>> Gender Date Income Car Attendance >>>>> Male 2019/3/01 10000 BMW Yes >>>>> Female 2019/5/02 9000 Toyota No >>>>> Male 2019/7/15 12000 Audi Yes >>>>> >>>>> I am predicting income using all other categorical variables. Maybe it is catboost! >>>>> >>>>> Thanks, >>>>> >>>>> M >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier L?pez wrote: >>>>> If you have datasets with many categorical features, and perhaps many categories, the tools in sklearn are quite limited, >>>>> but there are alternative implementations of boosted trees that are designed with categorical features in mind. Take a look >>>>> at catboost [1], which has an sklearn-compatible API. >>>>> >>>>> J >>>>> >>>>> [1] https://catboost.ai/ >>>>> >>>>> On Sat, Sep 14, 2019 at 3:40 AM C W wrote: >>>>> Hello all, >>>>> I'm very confused. Can the decision tree module handle both continuous and categorical features in the dataset? In this case, it's just CART (Classification and Regression Trees). >>>>> >>>>> For example, >>>>> Gender Age Income Car Attendance >>>>> Male 30 10000 BMW Yes >>>>> Female 35 9000 Toyota No >>>>> Male 50 12000 Audi Yes >>>>> >>>>> According to the documentation https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart, it can not! >>>>> >>>>> It says: "scikit-learn implementation does not support categorical variables for now". >>>>> >>>>> Is this true? If not, can someone point me to an example? If yes, what do people do? >>>>> >>>>> Thank you very much! >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>>> -- >>>>> Guillaume Lemaitre >>>>> INRIA Saclay - Parietal team >>>>> Center for Data Science Paris-Saclay >>>>> https://glemaitre.github.io/ >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From tmrsg11 at gmail.com Fri Oct 4 19:33:15 2019 From: tmrsg11 at gmail.com (C W) Date: Fri, 4 Oct 2019 19:33:15 -0400 Subject: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features? In-Reply-To: References: <5e9661ff-dfb2-cc2e-b71f-ba18024374a1@gmail.com> <7E3EE86D-4B8A-438A-B03A-8DFC8E1D8AB4@sebastianraschka.com> Message-ID: Thanks Sebastian, I think I get it. It's just have never seen it this way. Quite different from what I'm used in Elements of Statistical Learning. On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka wrote: > Not sure if there's a website for that. In any case, to explain this > differently, as discussed earlier sklearn assumes continuous features for > decision trees. So, it will use a binary threshold for splitting along a > feature attribute. In other words, it cannot do sth like > > if x == 1 then right child node > else left child node > > Instead, what it does is > > if x >= 0.5 then right child node > else left child node > > These are basically equivalent as you can see when you just plug in values > 0 and 1 for x. > > Best, > Sebastian > > > On Oct 4, 2019, at 5:34 PM, C W wrote: > > > > I don't understand your answer. > > > > Why after one-hot-encoding it still outputs greater than 0.5 or less > than? Does sklearn website have a working example on categorical input? > > > > Thanks! > > > > On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka < > mail at sebastianraschka.com> wrote: > > Like Nicolas said, the 0.5 is just a workaround but will do the right > thing on the one-hot encoded variables, here. You will find that the > threshold is always at 0.5 for these variables. I.e., what it will do is to > use the following conversion: > > > > treat as car_Audi=1 if car_Audi >= 0.5 > > treat as car_Audi=0 if car_Audi < 0.5 > > > > or, it may be > > > > treat as car_Audi=1 if car_Audi > 0.5 > > treat as car_Audi=0 if car_Audi <= 0.5 > > > > (Forgot which one sklearn is using, but either way. it will be fine.) > > > > Best, > > Sebastian > > > > > >> On Oct 4, 2019, at 1:44 PM, Nicolas Hug wrote: > >> > >> > >>> But, decision tree is still mistaking one-hot-encoding as numerical > input and split at 0.5. This is not right. Perhaps, I'm doing something > wrong? > >> > >> You're not doing anything wrong, and neither is the tree. Trees don't > support categorical variables in sklearn, so everything is treated as > numerical. > >> > >> This is why we do one-hot-encoding: so that a set of numerical (one hot > encoded) features can be treated as if they were just one categorical > feature. > >> > >> > >> > >> Nicolas > >> > >> On 10/4/19 2:01 PM, C W wrote: > >>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on > my part. > >>> > >>> Looks like I did one-hot-encoding correctly. My new variable names > are: car_Audi, car_BMW, etc. > >>> > >>> But, decision tree is still mistaking one-hot-encoding as numerical > input and split at 0.5. This is not right. Perhaps, I'm doing something > wrong? > >>> > >>> Is there a good toy example on the sklearn website? I am only see > this: > https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html > . > >>> > >>> Thanks! > >>> > >>> > >>> > >>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka < > mail at sebastianraschka.com> wrote: > >>> Hi, > >>> > >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, > Toyota=1, Audi=2) as numerical values, not category.The tree splits at 0.5 > and 1.5 > >>> > >>> that's not a onehot encoding then. > >>> > >>> For an Audi datapoint, it should be > >>> > >>> BMW=0 > >>> Toyota=0 > >>> Audi=1 > >>> > >>> for BMW > >>> > >>> BMW=1 > >>> Toyota=0 > >>> Audi=0 > >>> > >>> and for Toyota > >>> > >>> BMW=0 > >>> Toyota=1 > >>> Audi=0 > >>> > >>> The split threshold should then be at 0.5 for any of these features. > >>> > >>> Based on your email, I think you were assuming that the DT does the > one-hot encoding internally, which it doesn't. In practice, it is hard to > guess what is a nominal and what is a ordinal variable, so you have to do > the onehot encoding before you give the data to the decision tree. > >>> > >>> Best, > >>> Sebastian > >>> > >>>> On Oct 4, 2019, at 11:48 AM, C W wrote: > >>>> > >>>> I'm getting some funny results. I am doing a regression decision > tree, the response variables are assigned to levels. > >>>> > >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, > Toyota=1, Audi=2) as numerical values, not category. > >>>> > >>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? > How does the sklearn know internally 0 vs. 1 is categorical, not numerical? > >>>> > >>>> In R for instance, you do as.factor(), which explicitly states the > data type. > >>>> > >>>> Thank you! > >>>> > >>>> > >>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller > wrote: > >>>> > >>>> > >>>> On 9/15/19 8:16 AM, Guillaume Lema?tre wrote: > >>>>> > >>>>> > >>>>> On Sat, 14 Sep 2019 at 20:59, C W wrote: > >>>>> Thanks, Guillaume. > >>>>> Column transformer looks pretty neat. I've also heard though, this > pipeline can be tedious to set up? Specifying what you want for every > feature is a pain. > >>>>> > >>>>> It would be interesting for us which part of the pipeline is tedious > to set up to know if we can improve something there. > >>>>> Do you mean, that you would like to automatically detect of which > type of feature (categorical/numerical) and apply a > >>>>> default encoder/scaling such as discuss there: > https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 > >>>>> > >>>>> IMO, one a user perspective, it would be cleaner in some cases at > the cost of applying blindly a black box > >>>>> which might be dangerous. > >>>> Also see > https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor > >>>> Which basically does that. > >>>> > >>>> > >>>>> > >>>>> > >>>>> Jaiver, > >>>>> Actually, you guessed right. My real data has only one numerical > variable, looks more like this: > >>>>> > >>>>> Gender Date Income Car Attendance > >>>>> Male 2019/3/01 10000 BMW Yes > >>>>> Female 2019/5/02 9000 Toyota No > >>>>> Male 2019/7/15 12000 Audi Yes > >>>>> > >>>>> I am predicting income using all other categorical variables. Maybe > it is catboost! > >>>>> > >>>>> Thanks, > >>>>> > >>>>> M > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier L?pez wrote: > >>>>> If you have datasets with many categorical features, and perhaps > many categories, the tools in sklearn are quite limited, > >>>>> but there are alternative implementations of boosted trees that are > designed with categorical features in mind. Take a look > >>>>> at catboost [1], which has an sklearn-compatible API. > >>>>> > >>>>> J > >>>>> > >>>>> [1] https://catboost.ai/ > >>>>> > >>>>> On Sat, Sep 14, 2019 at 3:40 AM C W wrote: > >>>>> Hello all, > >>>>> I'm very confused. Can the decision tree module handle both > continuous and categorical features in the dataset? In this case, it's just > CART (Classification and Regression Trees). > >>>>> > >>>>> For example, > >>>>> Gender Age Income Car Attendance > >>>>> Male 30 10000 BMW Yes > >>>>> Female 35 9000 Toyota No > >>>>> Male 50 12000 Audi Yes > >>>>> > >>>>> According to the documentation > https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart, > it can not! > >>>>> > >>>>> It says: "scikit-learn implementation does not support categorical > variables for now". > >>>>> > >>>>> Is this true? If not, can someone point me to an example? If yes, > what do people do? > >>>>> > >>>>> Thank you very much! > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> scikit-learn mailing list > >>>>> scikit-learn at python.org > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > >>>>> _______________________________________________ > >>>>> scikit-learn mailing list > >>>>> scikit-learn at python.org > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > >>>>> _______________________________________________ > >>>>> scikit-learn mailing list > >>>>> scikit-learn at python.org > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > >>>>> > >>>>> > >>>>> -- > >>>>> Guillaume Lemaitre > >>>>> INRIA Saclay - Parietal team > >>>>> Center for Data Science Paris-Saclay > >>>>> https://glemaitre.github.io/ > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> scikit-learn mailing list > >>>>> > >>>>> scikit-learn at python.org > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > >>>> > >>>> _______________________________________________ > >>>> scikit-learn mailing list > >>>> scikit-learn at python.org > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > >>>> _______________________________________________ > >>>> scikit-learn mailing list > >>>> scikit-learn at python.org > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > >>> > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >>> > >>> > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Fri Oct 4 21:17:54 2019 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Fri, 4 Oct 2019 20:17:54 -0500 Subject: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features? In-Reply-To: References: <5e9661ff-dfb2-cc2e-b71f-ba18024374a1@gmail.com> <7E3EE86D-4B8A-438A-B03A-8DFC8E1D8AB4@sebastianraschka.com> Message-ID: <7A0589D1-D990-4FD6-9D11-AA804E34F3BC@sebastianraschka.com> Yeah, think of it more as a computational workaround for achieving the same thing more efficiently (although it looks inelegant/weird)-- something like that wouldn't be mentioned in textbooks. Best, Sebastian > On Oct 4, 2019, at 6:33 PM, C W wrote: > > Thanks Sebastian, I think I get it. > > It's just have never seen it this way. Quite different from what I'm used in Elements of Statistical Learning. > > On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka wrote: > Not sure if there's a website for that. In any case, to explain this differently, as discussed earlier sklearn assumes continuous features for decision trees. So, it will use a binary threshold for splitting along a feature attribute. In other words, it cannot do sth like > > if x == 1 then right child node > else left child node > > Instead, what it does is > > if x >= 0.5 then right child node > else left child node > > These are basically equivalent as you can see when you just plug in values 0 and 1 for x. > > Best, > Sebastian > > > On Oct 4, 2019, at 5:34 PM, C W wrote: > > > > I don't understand your answer. > > > > Why after one-hot-encoding it still outputs greater than 0.5 or less than? Does sklearn website have a working example on categorical input? > > > > Thanks! > > > > On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka wrote: > > Like Nicolas said, the 0.5 is just a workaround but will do the right thing on the one-hot encoded variables, here. You will find that the threshold is always at 0.5 for these variables. I.e., what it will do is to use the following conversion: > > > > treat as car_Audi=1 if car_Audi >= 0.5 > > treat as car_Audi=0 if car_Audi < 0.5 > > > > or, it may be > > > > treat as car_Audi=1 if car_Audi > 0.5 > > treat as car_Audi=0 if car_Audi <= 0.5 > > > > (Forgot which one sklearn is using, but either way. it will be fine.) > > > > Best, > > Sebastian > > > > > >> On Oct 4, 2019, at 1:44 PM, Nicolas Hug wrote: > >> > >> > >>> But, decision tree is still mistaking one-hot-encoding as numerical input and split at 0.5. This is not right. Perhaps, I'm doing something wrong? > >> > >> You're not doing anything wrong, and neither is the tree. Trees don't support categorical variables in sklearn, so everything is treated as numerical. > >> > >> This is why we do one-hot-encoding: so that a set of numerical (one hot encoded) features can be treated as if they were just one categorical feature. > >> > >> > >> > >> Nicolas > >> > >> On 10/4/19 2:01 PM, C W wrote: > >>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my part. > >>> > >>> Looks like I did one-hot-encoding correctly. My new variable names are: car_Audi, car_BMW, etc. > >>> > >>> But, decision tree is still mistaking one-hot-encoding as numerical input and split at 0.5. This is not right. Perhaps, I'm doing something wrong? > >>> > >>> Is there a good toy example on the sklearn website? I am only see this: https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html. > >>> > >>> Thanks! > >>> > >>> > >>> > >>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka wrote: > >>> Hi, > >>> > >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5 > >>> > >>> that's not a onehot encoding then. > >>> > >>> For an Audi datapoint, it should be > >>> > >>> BMW=0 > >>> Toyota=0 > >>> Audi=1 > >>> > >>> for BMW > >>> > >>> BMW=1 > >>> Toyota=0 > >>> Audi=0 > >>> > >>> and for Toyota > >>> > >>> BMW=0 > >>> Toyota=1 > >>> Audi=0 > >>> > >>> The split threshold should then be at 0.5 for any of these features. > >>> > >>> Based on your email, I think you were assuming that the DT does the one-hot encoding internally, which it doesn't. In practice, it is hard to guess what is a nominal and what is a ordinal variable, so you have to do the onehot encoding before you give the data to the decision tree. > >>> > >>> Best, > >>> Sebastian > >>> > >>>> On Oct 4, 2019, at 11:48 AM, C W wrote: > >>>> > >>>> I'm getting some funny results. I am doing a regression decision tree, the response variables are assigned to levels. > >>>> > >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, Audi=2) as numerical values, not category. > >>>> > >>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How does the sklearn know internally 0 vs. 1 is categorical, not numerical? > >>>> > >>>> In R for instance, you do as.factor(), which explicitly states the data type. > >>>> > >>>> Thank you! > >>>> > >>>> > >>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller wrote: > >>>> > >>>> > >>>> On 9/15/19 8:16 AM, Guillaume Lema?tre wrote: > >>>>> > >>>>> > >>>>> On Sat, 14 Sep 2019 at 20:59, C W wrote: > >>>>> Thanks, Guillaume. > >>>>> Column transformer looks pretty neat. I've also heard though, this pipeline can be tedious to set up? Specifying what you want for every feature is a pain. > >>>>> > >>>>> It would be interesting for us which part of the pipeline is tedious to set up to know if we can improve something there. > >>>>> Do you mean, that you would like to automatically detect of which type of feature (categorical/numerical) and apply a > >>>>> default encoder/scaling such as discuss there: https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 > >>>>> > >>>>> IMO, one a user perspective, it would be cleaner in some cases at the cost of applying blindly a black box > >>>>> which might be dangerous. > >>>> Also see https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor > >>>> Which basically does that. > >>>> > >>>> > >>>>> > >>>>> > >>>>> Jaiver, > >>>>> Actually, you guessed right. My real data has only one numerical variable, looks more like this: > >>>>> > >>>>> Gender Date Income Car Attendance > >>>>> Male 2019/3/01 10000 BMW Yes > >>>>> Female 2019/5/02 9000 Toyota No > >>>>> Male 2019/7/15 12000 Audi Yes > >>>>> > >>>>> I am predicting income using all other categorical variables. Maybe it is catboost! > >>>>> > >>>>> Thanks, > >>>>> > >>>>> M > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier L?pez wrote: > >>>>> If you have datasets with many categorical features, and perhaps many categories, the tools in sklearn are quite limited, > >>>>> but there are alternative implementations of boosted trees that are designed with categorical features in mind. Take a look > >>>>> at catboost [1], which has an sklearn-compatible API. > >>>>> > >>>>> J > >>>>> > >>>>> [1] https://catboost.ai/ > >>>>> > >>>>> On Sat, Sep 14, 2019 at 3:40 AM C W wrote: > >>>>> Hello all, > >>>>> I'm very confused. Can the decision tree module handle both continuous and categorical features in the dataset? In this case, it's just CART (Classification and Regression Trees). > >>>>> > >>>>> For example, > >>>>> Gender Age Income Car Attendance > >>>>> Male 30 10000 BMW Yes > >>>>> Female 35 9000 Toyota No > >>>>> Male 50 12000 Audi Yes > >>>>> > >>>>> According to the documentation https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart, it can not! > >>>>> > >>>>> It says: "scikit-learn implementation does not support categorical variables for now". > >>>>> > >>>>> Is this true? If not, can someone point me to an example? If yes, what do people do? > >>>>> > >>>>> Thank you very much! > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> scikit-learn mailing list > >>>>> scikit-learn at python.org > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > >>>>> _______________________________________________ > >>>>> scikit-learn mailing list > >>>>> scikit-learn at python.org > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > >>>>> _______________________________________________ > >>>>> scikit-learn mailing list > >>>>> scikit-learn at python.org > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > >>>>> > >>>>> > >>>>> -- > >>>>> Guillaume Lemaitre > >>>>> INRIA Saclay - Parietal team > >>>>> Center for Data Science Paris-Saclay > >>>>> https://glemaitre.github.io/ > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> scikit-learn mailing list > >>>>> > >>>>> scikit-learn at python.org > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > >>>> > >>>> _______________________________________________ > >>>> scikit-learn mailing list > >>>> scikit-learn at python.org > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > >>>> _______________________________________________ > >>>> scikit-learn mailing list > >>>> scikit-learn at python.org > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > >>> > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >>> > >>> > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From tmrsg11 at gmail.com Fri Oct 4 23:09:11 2019 From: tmrsg11 at gmail.com (C W) Date: Fri, 4 Oct 2019 23:09:11 -0400 Subject: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features? In-Reply-To: <7A0589D1-D990-4FD6-9D11-AA804E34F3BC@sebastianraschka.com> References: <5e9661ff-dfb2-cc2e-b71f-ba18024374a1@gmail.com> <7E3EE86D-4B8A-438A-B03A-8DFC8E1D8AB4@sebastianraschka.com> <7A0589D1-D990-4FD6-9D11-AA804E34F3BC@sebastianraschka.com> Message-ID: On a separate note, what do you use for plotting? I found graphviz, but you have to first save it as a png on your computer. That's a lot work for just one plot. Is there something like a matplotlib? Thanks! On Fri, Oct 4, 2019 at 9:42 PM Sebastian Raschka wrote: > Yeah, think of it more as a computational workaround for achieving the > same thing more efficiently (although it looks inelegant/weird)-- something > like that wouldn't be mentioned in textbooks. > > Best, > Sebastian > > > On Oct 4, 2019, at 6:33 PM, C W wrote: > > > > Thanks Sebastian, I think I get it. > > > > It's just have never seen it this way. Quite different from what I'm > used in Elements of Statistical Learning. > > > > On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka < > mail at sebastianraschka.com> wrote: > > Not sure if there's a website for that. In any case, to explain this > differently, as discussed earlier sklearn assumes continuous features for > decision trees. So, it will use a binary threshold for splitting along a > feature attribute. In other words, it cannot do sth like > > > > if x == 1 then right child node > > else left child node > > > > Instead, what it does is > > > > if x >= 0.5 then right child node > > else left child node > > > > These are basically equivalent as you can see when you just plug in > values 0 and 1 for x. > > > > Best, > > Sebastian > > > > > On Oct 4, 2019, at 5:34 PM, C W wrote: > > > > > > I don't understand your answer. > > > > > > Why after one-hot-encoding it still outputs greater than 0.5 or less > than? Does sklearn website have a working example on categorical input? > > > > > > Thanks! > > > > > > On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka < > mail at sebastianraschka.com> wrote: > > > Like Nicolas said, the 0.5 is just a workaround but will do the right > thing on the one-hot encoded variables, here. You will find that the > threshold is always at 0.5 for these variables. I.e., what it will do is to > use the following conversion: > > > > > > treat as car_Audi=1 if car_Audi >= 0.5 > > > treat as car_Audi=0 if car_Audi < 0.5 > > > > > > or, it may be > > > > > > treat as car_Audi=1 if car_Audi > 0.5 > > > treat as car_Audi=0 if car_Audi <= 0.5 > > > > > > (Forgot which one sklearn is using, but either way. it will be fine.) > > > > > > Best, > > > Sebastian > > > > > > > > >> On Oct 4, 2019, at 1:44 PM, Nicolas Hug wrote: > > >> > > >> > > >>> But, decision tree is still mistaking one-hot-encoding as numerical > input and split at 0.5. This is not right. Perhaps, I'm doing something > wrong? > > >> > > >> You're not doing anything wrong, and neither is the tree. Trees don't > support categorical variables in sklearn, so everything is treated as > numerical. > > >> > > >> This is why we do one-hot-encoding: so that a set of numerical (one > hot encoded) features can be treated as if they were just one categorical > feature. > > >> > > >> > > >> > > >> Nicolas > > >> > > >> On 10/4/19 2:01 PM, C W wrote: > > >>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo > on my part. > > >>> > > >>> Looks like I did one-hot-encoding correctly. My new variable names > are: car_Audi, car_BMW, etc. > > >>> > > >>> But, decision tree is still mistaking one-hot-encoding as numerical > input and split at 0.5. This is not right. Perhaps, I'm doing something > wrong? > > >>> > > >>> Is there a good toy example on the sklearn website? I am only see > this: > https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html > . > > >>> > > >>> Thanks! > > >>> > > >>> > > >>> > > >>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka < > mail at sebastianraschka.com> wrote: > > >>> Hi, > > >>> > > >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, > Toyota=1, Audi=2) as numerical values, not category.The tree splits at 0.5 > and 1.5 > > >>> > > >>> that's not a onehot encoding then. > > >>> > > >>> For an Audi datapoint, it should be > > >>> > > >>> BMW=0 > > >>> Toyota=0 > > >>> Audi=1 > > >>> > > >>> for BMW > > >>> > > >>> BMW=1 > > >>> Toyota=0 > > >>> Audi=0 > > >>> > > >>> and for Toyota > > >>> > > >>> BMW=0 > > >>> Toyota=1 > > >>> Audi=0 > > >>> > > >>> The split threshold should then be at 0.5 for any of these features. > > >>> > > >>> Based on your email, I think you were assuming that the DT does the > one-hot encoding internally, which it doesn't. In practice, it is hard to > guess what is a nominal and what is a ordinal variable, so you have to do > the onehot encoding before you give the data to the decision tree. > > >>> > > >>> Best, > > >>> Sebastian > > >>> > > >>>> On Oct 4, 2019, at 11:48 AM, C W wrote: > > >>>> > > >>>> I'm getting some funny results. I am doing a regression decision > tree, the response variables are assigned to levels. > > >>>> > > >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, > Toyota=1, Audi=2) as numerical values, not category. > > >>>> > > >>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? > How does the sklearn know internally 0 vs. 1 is categorical, not numerical? > > >>>> > > >>>> In R for instance, you do as.factor(), which explicitly states the > data type. > > >>>> > > >>>> Thank you! > > >>>> > > >>>> > > >>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller > wrote: > > >>>> > > >>>> > > >>>> On 9/15/19 8:16 AM, Guillaume Lema?tre wrote: > > >>>>> > > >>>>> > > >>>>> On Sat, 14 Sep 2019 at 20:59, C W wrote: > > >>>>> Thanks, Guillaume. > > >>>>> Column transformer looks pretty neat. I've also heard though, this > pipeline can be tedious to set up? Specifying what you want for every > feature is a pain. > > >>>>> > > >>>>> It would be interesting for us which part of the pipeline is > tedious to set up to know if we can improve something there. > > >>>>> Do you mean, that you would like to automatically detect of which > type of feature (categorical/numerical) and apply a > > >>>>> default encoder/scaling such as discuss there: > https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 > > >>>>> > > >>>>> IMO, one a user perspective, it would be cleaner in some cases at > the cost of applying blindly a black box > > >>>>> which might be dangerous. > > >>>> Also see > https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor > > >>>> Which basically does that. > > >>>> > > >>>> > > >>>>> > > >>>>> > > >>>>> Jaiver, > > >>>>> Actually, you guessed right. My real data has only one numerical > variable, looks more like this: > > >>>>> > > >>>>> Gender Date Income Car Attendance > > >>>>> Male 2019/3/01 10000 BMW Yes > > >>>>> Female 2019/5/02 9000 Toyota No > > >>>>> Male 2019/7/15 12000 Audi Yes > > >>>>> > > >>>>> I am predicting income using all other categorical variables. > Maybe it is catboost! > > >>>>> > > >>>>> Thanks, > > >>>>> > > >>>>> M > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier L?pez > wrote: > > >>>>> If you have datasets with many categorical features, and perhaps > many categories, the tools in sklearn are quite limited, > > >>>>> but there are alternative implementations of boosted trees that > are designed with categorical features in mind. Take a look > > >>>>> at catboost [1], which has an sklearn-compatible API. > > >>>>> > > >>>>> J > > >>>>> > > >>>>> [1] https://catboost.ai/ > > >>>>> > > >>>>> On Sat, Sep 14, 2019 at 3:40 AM C W wrote: > > >>>>> Hello all, > > >>>>> I'm very confused. Can the decision tree module handle both > continuous and categorical features in the dataset? In this case, it's just > CART (Classification and Regression Trees). > > >>>>> > > >>>>> For example, > > >>>>> Gender Age Income Car Attendance > > >>>>> Male 30 10000 BMW Yes > > >>>>> Female 35 9000 Toyota No > > >>>>> Male 50 12000 Audi Yes > > >>>>> > > >>>>> According to the documentation > https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart, > it can not! > > >>>>> > > >>>>> It says: "scikit-learn implementation does not support categorical > variables for now". > > >>>>> > > >>>>> Is this true? If not, can someone point me to an example? If yes, > what do people do? > > >>>>> > > >>>>> Thank you very much! > > >>>>> > > >>>>> > > >>>>> > > >>>>> _______________________________________________ > > >>>>> scikit-learn mailing list > > >>>>> scikit-learn at python.org > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>>>> _______________________________________________ > > >>>>> scikit-learn mailing list > > >>>>> scikit-learn at python.org > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>>>> _______________________________________________ > > >>>>> scikit-learn mailing list > > >>>>> scikit-learn at python.org > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>>>> > > >>>>> > > >>>>> -- > > >>>>> Guillaume Lemaitre > > >>>>> INRIA Saclay - Parietal team > > >>>>> Center for Data Science Paris-Saclay > > >>>>> https://glemaitre.github.io/ > > >>>>> > > >>>>> > > >>>>> _______________________________________________ > > >>>>> scikit-learn mailing list > > >>>>> > > >>>>> scikit-learn at python.org > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>>> > > >>>> _______________________________________________ > > >>>> scikit-learn mailing list > > >>>> scikit-learn at python.org > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>>> _______________________________________________ > > >>>> scikit-learn mailing list > > >>>> scikit-learn at python.org > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>> > > >>> _______________________________________________ > > >>> scikit-learn mailing list > > >>> scikit-learn at python.org > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>> > > >>> > > >>> _______________________________________________ > > >>> scikit-learn mailing list > > >>> > > >>> scikit-learn at python.org > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > >> _______________________________________________ > > >> scikit-learn mailing list > > >> scikit-learn at python.org > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Fri Oct 4 23:28:46 2019 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Fri, 4 Oct 2019 22:28:46 -0500 Subject: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features? In-Reply-To: References: <5e9661ff-dfb2-cc2e-b71f-ba18024374a1@gmail.com> <7E3EE86D-4B8A-438A-B03A-8DFC8E1D8AB4@sebastianraschka.com> <7A0589D1-D990-4FD6-9D11-AA804E34F3BC@sebastianraschka.com> Message-ID: <4FC33890-94D3-4AA8-8FA9-EF1FADFD4C20@sebastianraschka.com> The docs show a way such that you don't need to write it as png file using tree.plot_tree: https://scikit-learn.org/stable/modules/tree.html#classification I don't remember why, but I think I had problems with that in the past (I think it didn't look so nice visually, but don't remember), which is why I still stick to graphviz. For my use cases, it's not much hassle -- it used to be a bit of a hassle to get GraphViz working, but now you can do conda install pydotplus conda install graphviz Coincidentally, I just made an example for a lecture I was teaching on Tue: https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/06_trees/code/06-trees_demo.ipynb Best, Sebastian > On Oct 4, 2019, at 10:09 PM, C W wrote: > > On a separate note, what do you use for plotting? > > I found graphviz, but you have to first save it as a png on your computer. That's a lot work for just one plot. Is there something like a matplotlib? > > Thanks! > > On Fri, Oct 4, 2019 at 9:42 PM Sebastian Raschka wrote: > Yeah, think of it more as a computational workaround for achieving the same thing more efficiently (although it looks inelegant/weird)-- something like that wouldn't be mentioned in textbooks. > > Best, > Sebastian > > > On Oct 4, 2019, at 6:33 PM, C W wrote: > > > > Thanks Sebastian, I think I get it. > > > > It's just have never seen it this way. Quite different from what I'm used in Elements of Statistical Learning. > > > > On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka wrote: > > Not sure if there's a website for that. In any case, to explain this differently, as discussed earlier sklearn assumes continuous features for decision trees. So, it will use a binary threshold for splitting along a feature attribute. In other words, it cannot do sth like > > > > if x == 1 then right child node > > else left child node > > > > Instead, what it does is > > > > if x >= 0.5 then right child node > > else left child node > > > > These are basically equivalent as you can see when you just plug in values 0 and 1 for x. > > > > Best, > > Sebastian > > > > > On Oct 4, 2019, at 5:34 PM, C W wrote: > > > > > > I don't understand your answer. > > > > > > Why after one-hot-encoding it still outputs greater than 0.5 or less than? Does sklearn website have a working example on categorical input? > > > > > > Thanks! > > > > > > On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka wrote: > > > Like Nicolas said, the 0.5 is just a workaround but will do the right thing on the one-hot encoded variables, here. You will find that the threshold is always at 0.5 for these variables. I.e., what it will do is to use the following conversion: > > > > > > treat as car_Audi=1 if car_Audi >= 0.5 > > > treat as car_Audi=0 if car_Audi < 0.5 > > > > > > or, it may be > > > > > > treat as car_Audi=1 if car_Audi > 0.5 > > > treat as car_Audi=0 if car_Audi <= 0.5 > > > > > > (Forgot which one sklearn is using, but either way. it will be fine.) > > > > > > Best, > > > Sebastian > > > > > > > > >> On Oct 4, 2019, at 1:44 PM, Nicolas Hug wrote: > > >> > > >> > > >>> But, decision tree is still mistaking one-hot-encoding as numerical input and split at 0.5. This is not right. Perhaps, I'm doing something wrong? > > >> > > >> You're not doing anything wrong, and neither is the tree. Trees don't support categorical variables in sklearn, so everything is treated as numerical. > > >> > > >> This is why we do one-hot-encoding: so that a set of numerical (one hot encoded) features can be treated as if they were just one categorical feature. > > >> > > >> > > >> > > >> Nicolas > > >> > > >> On 10/4/19 2:01 PM, C W wrote: > > >>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my part. > > >>> > > >>> Looks like I did one-hot-encoding correctly. My new variable names are: car_Audi, car_BMW, etc. > > >>> > > >>> But, decision tree is still mistaking one-hot-encoding as numerical input and split at 0.5. This is not right. Perhaps, I'm doing something wrong? > > >>> > > >>> Is there a good toy example on the sklearn website? I am only see this: https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html. > > >>> > > >>> Thanks! > > >>> > > >>> > > >>> > > >>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka wrote: > > >>> Hi, > > >>> > > >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5 > > >>> > > >>> that's not a onehot encoding then. > > >>> > > >>> For an Audi datapoint, it should be > > >>> > > >>> BMW=0 > > >>> Toyota=0 > > >>> Audi=1 > > >>> > > >>> for BMW > > >>> > > >>> BMW=1 > > >>> Toyota=0 > > >>> Audi=0 > > >>> > > >>> and for Toyota > > >>> > > >>> BMW=0 > > >>> Toyota=1 > > >>> Audi=0 > > >>> > > >>> The split threshold should then be at 0.5 for any of these features. > > >>> > > >>> Based on your email, I think you were assuming that the DT does the one-hot encoding internally, which it doesn't. In practice, it is hard to guess what is a nominal and what is a ordinal variable, so you have to do the onehot encoding before you give the data to the decision tree. > > >>> > > >>> Best, > > >>> Sebastian > > >>> > > >>>> On Oct 4, 2019, at 11:48 AM, C W wrote: > > >>>> > > >>>> I'm getting some funny results. I am doing a regression decision tree, the response variables are assigned to levels. > > >>>> > > >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, Audi=2) as numerical values, not category. > > >>>> > > >>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How does the sklearn know internally 0 vs. 1 is categorical, not numerical? > > >>>> > > >>>> In R for instance, you do as.factor(), which explicitly states the data type. > > >>>> > > >>>> Thank you! > > >>>> > > >>>> > > >>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller wrote: > > >>>> > > >>>> > > >>>> On 9/15/19 8:16 AM, Guillaume Lema?tre wrote: > > >>>>> > > >>>>> > > >>>>> On Sat, 14 Sep 2019 at 20:59, C W wrote: > > >>>>> Thanks, Guillaume. > > >>>>> Column transformer looks pretty neat. I've also heard though, this pipeline can be tedious to set up? Specifying what you want for every feature is a pain. > > >>>>> > > >>>>> It would be interesting for us which part of the pipeline is tedious to set up to know if we can improve something there. > > >>>>> Do you mean, that you would like to automatically detect of which type of feature (categorical/numerical) and apply a > > >>>>> default encoder/scaling such as discuss there: https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 > > >>>>> > > >>>>> IMO, one a user perspective, it would be cleaner in some cases at the cost of applying blindly a black box > > >>>>> which might be dangerous. > > >>>> Also see https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor > > >>>> Which basically does that. > > >>>> > > >>>> > > >>>>> > > >>>>> > > >>>>> Jaiver, > > >>>>> Actually, you guessed right. My real data has only one numerical variable, looks more like this: > > >>>>> > > >>>>> Gender Date Income Car Attendance > > >>>>> Male 2019/3/01 10000 BMW Yes > > >>>>> Female 2019/5/02 9000 Toyota No > > >>>>> Male 2019/7/15 12000 Audi Yes > > >>>>> > > >>>>> I am predicting income using all other categorical variables. Maybe it is catboost! > > >>>>> > > >>>>> Thanks, > > >>>>> > > >>>>> M > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier L?pez wrote: > > >>>>> If you have datasets with many categorical features, and perhaps many categories, the tools in sklearn are quite limited, > > >>>>> but there are alternative implementations of boosted trees that are designed with categorical features in mind. Take a look > > >>>>> at catboost [1], which has an sklearn-compatible API. > > >>>>> > > >>>>> J > > >>>>> > > >>>>> [1] https://catboost.ai/ > > >>>>> > > >>>>> On Sat, Sep 14, 2019 at 3:40 AM C W wrote: > > >>>>> Hello all, > > >>>>> I'm very confused. Can the decision tree module handle both continuous and categorical features in the dataset? In this case, it's just CART (Classification and Regression Trees). > > >>>>> > > >>>>> For example, > > >>>>> Gender Age Income Car Attendance > > >>>>> Male 30 10000 BMW Yes > > >>>>> Female 35 9000 Toyota No > > >>>>> Male 50 12000 Audi Yes > > >>>>> > > >>>>> According to the documentation https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart, it can not! > > >>>>> > > >>>>> It says: "scikit-learn implementation does not support categorical variables for now". > > >>>>> > > >>>>> Is this true? If not, can someone point me to an example? If yes, what do people do? > > >>>>> > > >>>>> Thank you very much! > > >>>>> > > >>>>> > > >>>>> > > >>>>> _______________________________________________ > > >>>>> scikit-learn mailing list > > >>>>> scikit-learn at python.org > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>>>> _______________________________________________ > > >>>>> scikit-learn mailing list > > >>>>> scikit-learn at python.org > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>>>> _______________________________________________ > > >>>>> scikit-learn mailing list > > >>>>> scikit-learn at python.org > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>>>> > > >>>>> > > >>>>> -- > > >>>>> Guillaume Lemaitre > > >>>>> INRIA Saclay - Parietal team > > >>>>> Center for Data Science Paris-Saclay > > >>>>> https://glemaitre.github.io/ > > >>>>> > > >>>>> > > >>>>> _______________________________________________ > > >>>>> scikit-learn mailing list > > >>>>> > > >>>>> scikit-learn at python.org > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>>> > > >>>> _______________________________________________ > > >>>> scikit-learn mailing list > > >>>> scikit-learn at python.org > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>>> _______________________________________________ > > >>>> scikit-learn mailing list > > >>>> scikit-learn at python.org > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>> > > >>> _______________________________________________ > > >>> scikit-learn mailing list > > >>> scikit-learn at python.org > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>> > > >>> > > >>> _______________________________________________ > > >>> scikit-learn mailing list > > >>> > > >>> scikit-learn at python.org > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > >> _______________________________________________ > > >> scikit-learn mailing list > > >> scikit-learn at python.org > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From javaeurusd at gmail.com Sat Oct 5 12:20:37 2019 From: javaeurusd at gmail.com (Mike Smith) Date: Sat, 5 Oct 2019 09:20:37 -0700 Subject: [scikit-learn] scikit-learn Digest, Vol 43, Issue 8 In-Reply-To: References: Message-ID: Are Nearest Neighbor models better than decision trees for Adaboost? On Sat, Oct 5, 2019 at 9:02 AM wrote: > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Re: Can Scikit-learn decision tree (CART) have both > continuous and categorical features? (Sebastian Raschka) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 4 Oct 2019 22:28:46 -0500 > From: Sebastian Raschka > To: Scikit-learn mailing list > Subject: Re: [scikit-learn] Can Scikit-learn decision tree (CART) have > both continuous and categorical features? > Message-ID: > <4FC33890-94D3-4AA8-8FA9-EF1FADFD4C20 at sebastianraschka.com> > Content-Type: text/plain; charset=utf-8 > > The docs show a way such that you don't need to write it as png file using > tree.plot_tree: > https://scikit-learn.org/stable/modules/tree.html#classification > > I don't remember why, but I think I had problems with that in the past (I > think it didn't look so nice visually, but don't remember), which is why I > still stick to graphviz. For my use cases, it's not much hassle -- it used > to be a bit of a hassle to get GraphViz working, but now you can do > > conda install pydotplus > conda install graphviz > > Coincidentally, I just made an example for a lecture I was teaching on > Tue: > https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/06_trees/code/06-trees_demo.ipynb > > Best, > Sebastian > > > > On Oct 4, 2019, at 10:09 PM, C W wrote: > > > > On a separate note, what do you use for plotting? > > > > I found graphviz, but you have to first save it as a png on your > computer. That's a lot work for just one plot. Is there something like a > matplotlib? > > > > Thanks! > > > > On Fri, Oct 4, 2019 at 9:42 PM Sebastian Raschka < > mail at sebastianraschka.com> wrote: > > Yeah, think of it more as a computational workaround for achieving the > same thing more efficiently (although it looks inelegant/weird)-- something > like that wouldn't be mentioned in textbooks. > > > > Best, > > Sebastian > > > > > On Oct 4, 2019, at 6:33 PM, C W wrote: > > > > > > Thanks Sebastian, I think I get it. > > > > > > It's just have never seen it this way. Quite different from what I'm > used in Elements of Statistical Learning. > > > > > > On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka < > mail at sebastianraschka.com> wrote: > > > Not sure if there's a website for that. In any case, to explain this > differently, as discussed earlier sklearn assumes continuous features for > decision trees. So, it will use a binary threshold for splitting along a > feature attribute. In other words, it cannot do sth like > > > > > > if x == 1 then right child node > > > else left child node > > > > > > Instead, what it does is > > > > > > if x >= 0.5 then right child node > > > else left child node > > > > > > These are basically equivalent as you can see when you just plug in > values 0 and 1 for x. > > > > > > Best, > > > Sebastian > > > > > > > On Oct 4, 2019, at 5:34 PM, C W wrote: > > > > > > > > I don't understand your answer. > > > > > > > > Why after one-hot-encoding it still outputs greater than 0.5 or less > than? Does sklearn website have a working example on categorical input? > > > > > > > > Thanks! > > > > > > > > On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka < > mail at sebastianraschka.com> wrote: > > > > Like Nicolas said, the 0.5 is just a workaround but will do the > right thing on the one-hot encoded variables, here. You will find that the > threshold is always at 0.5 for these variables. I.e., what it will do is to > use the following conversion: > > > > > > > > treat as car_Audi=1 if car_Audi >= 0.5 > > > > treat as car_Audi=0 if car_Audi < 0.5 > > > > > > > > or, it may be > > > > > > > > treat as car_Audi=1 if car_Audi > 0.5 > > > > treat as car_Audi=0 if car_Audi <= 0.5 > > > > > > > > (Forgot which one sklearn is using, but either way. it will be fine.) > > > > > > > > Best, > > > > Sebastian > > > > > > > > > > > >> On Oct 4, 2019, at 1:44 PM, Nicolas Hug wrote: > > > >> > > > >> > > > >>> But, decision tree is still mistaking one-hot-encoding as > numerical input and split at 0.5. This is not right. Perhaps, I'm doing > something wrong? > > > >> > > > >> You're not doing anything wrong, and neither is the tree. Trees > don't support categorical variables in sklearn, so everything is treated as > numerical. > > > >> > > > >> This is why we do one-hot-encoding: so that a set of numerical (one > hot encoded) features can be treated as if they were just one categorical > feature. > > > >> > > > >> > > > >> > > > >> Nicolas > > > >> > > > >> On 10/4/19 2:01 PM, C W wrote: > > > >>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, > typo on my part. > > > >>> > > > >>> Looks like I did one-hot-encoding correctly. My new variable names > are: car_Audi, car_BMW, etc. > > > >>> > > > >>> But, decision tree is still mistaking one-hot-encoding as > numerical input and split at 0.5. This is not right. Perhaps, I'm doing > something wrong? > > > >>> > > > >>> Is there a good toy example on the sklearn website? I am only see > this: > https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html > . > > > >>> > > > >>> Thanks! > > > >>> > > > >>> > > > >>> > > > >>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka < > mail at sebastianraschka.com> wrote: > > > >>> Hi, > > > >>> > > > >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, > Toyota=1, Audi=2) as numerical values, not category.The tree splits at 0.5 > and 1.5 > > > >>> > > > >>> that's not a onehot encoding then. > > > >>> > > > >>> For an Audi datapoint, it should be > > > >>> > > > >>> BMW=0 > > > >>> Toyota=0 > > > >>> Audi=1 > > > >>> > > > >>> for BMW > > > >>> > > > >>> BMW=1 > > > >>> Toyota=0 > > > >>> Audi=0 > > > >>> > > > >>> and for Toyota > > > >>> > > > >>> BMW=0 > > > >>> Toyota=1 > > > >>> Audi=0 > > > >>> > > > >>> The split threshold should then be at 0.5 for any of these > features. > > > >>> > > > >>> Based on your email, I think you were assuming that the DT does > the one-hot encoding internally, which it doesn't. In practice, it is hard > to guess what is a nominal and what is a ordinal variable, so you have to > do the onehot encoding before you give the data to the decision tree. > > > >>> > > > >>> Best, > > > >>> Sebastian > > > >>> > > > >>>> On Oct 4, 2019, at 11:48 AM, C W wrote: > > > >>>> > > > >>>> I'm getting some funny results. I am doing a regression decision > tree, the response variables are assigned to levels. > > > >>>> > > > >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, > Toyota=1, Audi=2) as numerical values, not category. > > > >>>> > > > >>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding > wrong? How does the sklearn know internally 0 vs. 1 is categorical, not > numerical? > > > >>>> > > > >>>> In R for instance, you do as.factor(), which explicitly states > the data type. > > > >>>> > > > >>>> Thank you! > > > >>>> > > > >>>> > > > >>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller < > t3kcit at gmail.com> wrote: > > > >>>> > > > >>>> > > > >>>> On 9/15/19 8:16 AM, Guillaume Lema?tre wrote: > > > >>>>> > > > >>>>> > > > >>>>> On Sat, 14 Sep 2019 at 20:59, C W wrote: > > > >>>>> Thanks, Guillaume. > > > >>>>> Column transformer looks pretty neat. I've also heard though, > this pipeline can be tedious to set up? Specifying what you want for every > feature is a pain. > > > >>>>> > > > >>>>> It would be interesting for us which part of the pipeline is > tedious to set up to know if we can improve something there. > > > >>>>> Do you mean, that you would like to automatically detect of > which type of feature (categorical/numerical) and apply a > > > >>>>> default encoder/scaling such as discuss there: > https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 > > > >>>>> > > > >>>>> IMO, one a user perspective, it would be cleaner in some cases > at the cost of applying blindly a black box > > > >>>>> which might be dangerous. > > > >>>> Also see > https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor > > > >>>> Which basically does that. > > > >>>> > > > >>>> > > > >>>>> > > > >>>>> > > > >>>>> Jaiver, > > > >>>>> Actually, you guessed right. My real data has only one numerical > variable, looks more like this: > > > >>>>> > > > >>>>> Gender Date Income Car Attendance > > > >>>>> Male 2019/3/01 10000 BMW Yes > > > >>>>> Female 2019/5/02 9000 Toyota No > > > >>>>> Male 2019/7/15 12000 Audi Yes > > > >>>>> > > > >>>>> I am predicting income using all other categorical variables. > Maybe it is catboost! > > > >>>>> > > > >>>>> Thanks, > > > >>>>> > > > >>>>> M > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier L?pez > wrote: > > > >>>>> If you have datasets with many categorical features, and perhaps > many categories, the tools in sklearn are quite limited, > > > >>>>> but there are alternative implementations of boosted trees that > are designed with categorical features in mind. Take a look > > > >>>>> at catboost [1], which has an sklearn-compatible API. > > > >>>>> > > > >>>>> J > > > >>>>> > > > >>>>> [1] https://catboost.ai/ > > > >>>>> > > > >>>>> On Sat, Sep 14, 2019 at 3:40 AM C W wrote: > > > >>>>> Hello all, > > > >>>>> I'm very confused. Can the decision tree module handle both > continuous and categorical features in the dataset? In this case, it's just > CART (Classification and Regression Trees). > > > >>>>> > > > >>>>> For example, > > > >>>>> Gender Age Income Car Attendance > > > >>>>> Male 30 10000 BMW Yes > > > >>>>> Female 35 9000 Toyota No > > > >>>>> Male 50 12000 Audi Yes > > > >>>>> > > > >>>>> According to the documentation > https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart, > it can not! > > > >>>>> > > > >>>>> It says: "scikit-learn implementation does not support > categorical variables for now". > > > >>>>> > > > >>>>> Is this true? If not, can someone point me to an example? If > yes, what do people do? > > > >>>>> > > > >>>>> Thank you very much! > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> _______________________________________________ > > > >>>>> scikit-learn mailing list > > > >>>>> scikit-learn at python.org > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > > >>>>> _______________________________________________ > > > >>>>> scikit-learn mailing list > > > >>>>> scikit-learn at python.org > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > > >>>>> _______________________________________________ > > > >>>>> scikit-learn mailing list > > > >>>>> scikit-learn at python.org > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > > >>>>> > > > >>>>> > > > >>>>> -- > > > >>>>> Guillaume Lemaitre > > > >>>>> INRIA Saclay - Parietal team > > > >>>>> Center for Data Science Paris-Saclay > > > >>>>> https://glemaitre.github.io/ > > > >>>>> > > > >>>>> > > > >>>>> _______________________________________________ > > > >>>>> scikit-learn mailing list > > > >>>>> > > > >>>>> scikit-learn at python.org > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > > >>>> > > > >>>> _______________________________________________ > > > >>>> scikit-learn mailing list > > > >>>> scikit-learn at python.org > > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > > > >>>> _______________________________________________ > > > >>>> scikit-learn mailing list > > > >>>> scikit-learn at python.org > > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > > > >>> > > > >>> _______________________________________________ > > > >>> scikit-learn mailing list > > > >>> scikit-learn at python.org > > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > > >>> > > > >>> > > > >>> _______________________________________________ > > > >>> scikit-learn mailing list > > > >>> > > > >>> scikit-learn at python.org > > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > > >> _______________________________________________ > > > >> scikit-learn mailing list > > > >> scikit-learn at python.org > > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > _______________________________________________ > > > > scikit-learn mailing list > > > > scikit-learn at python.org > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > > > scikit-learn mailing list > > > > scikit-learn at python.org > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 43, Issue 8 > ******************************************* > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tmrsg11 at gmail.com Sat Oct 5 14:50:09 2019 From: tmrsg11 at gmail.com (C W) Date: Sat, 5 Oct 2019 14:50:09 -0400 Subject: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features? In-Reply-To: <4FC33890-94D3-4AA8-8FA9-EF1FADFD4C20@sebastianraschka.com> References: <5e9661ff-dfb2-cc2e-b71f-ba18024374a1@gmail.com> <7E3EE86D-4B8A-438A-B03A-8DFC8E1D8AB4@sebastianraschka.com> <7A0589D1-D990-4FD6-9D11-AA804E34F3BC@sebastianraschka.com> <4FC33890-94D3-4AA8-8FA9-EF1FADFD4C20@sebastianraschka.com> Message-ID: Thanks, great material! I got pydotplus with graphviz to work. Using the code on sklean website [1], tree.plot_tree(clf.fit(iris.data, iris.target)) gives an error: AttributeError: module 'sklearn.tree' has no attribute 'plot_tree' Both my colleague and I got the same error message. Per this post https://github.com/Microsoft/LightGBM/issues/1844, a PyPI update is needed. [1] sklearn link: https://scikit-learn.org/stable/modules/tree.html#classification On Fri, Oct 4, 2019 at 11:52 PM Sebastian Raschka wrote: > The docs show a way such that you don't need to write it as png file using > tree.plot_tree: > https://scikit-learn.org/stable/modules/tree.html#classification > > I don't remember why, but I think I had problems with that in the past (I > think it didn't look so nice visually, but don't remember), which is why I > still stick to graphviz. For my use cases, it's not much hassle -- it used > to be a bit of a hassle to get GraphViz working, but now you can do > > conda install pydotplus > conda install graphviz > > Coincidentally, I just made an example for a lecture I was teaching on > Tue: > https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/06_trees/code/06-trees_demo.ipynb > > Best, > Sebastian > > > > On Oct 4, 2019, at 10:09 PM, C W wrote: > > > > On a separate note, what do you use for plotting? > > > > I found graphviz, but you have to first save it as a png on your > computer. That's a lot work for just one plot. Is there something like a > matplotlib? > > > > Thanks! > > > > On Fri, Oct 4, 2019 at 9:42 PM Sebastian Raschka < > mail at sebastianraschka.com> wrote: > > Yeah, think of it more as a computational workaround for achieving the > same thing more efficiently (although it looks inelegant/weird)-- something > like that wouldn't be mentioned in textbooks. > > > > Best, > > Sebastian > > > > > On Oct 4, 2019, at 6:33 PM, C W wrote: > > > > > > Thanks Sebastian, I think I get it. > > > > > > It's just have never seen it this way. Quite different from what I'm > used in Elements of Statistical Learning. > > > > > > On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka < > mail at sebastianraschka.com> wrote: > > > Not sure if there's a website for that. In any case, to explain this > differently, as discussed earlier sklearn assumes continuous features for > decision trees. So, it will use a binary threshold for splitting along a > feature attribute. In other words, it cannot do sth like > > > > > > if x == 1 then right child node > > > else left child node > > > > > > Instead, what it does is > > > > > > if x >= 0.5 then right child node > > > else left child node > > > > > > These are basically equivalent as you can see when you just plug in > values 0 and 1 for x. > > > > > > Best, > > > Sebastian > > > > > > > On Oct 4, 2019, at 5:34 PM, C W wrote: > > > > > > > > I don't understand your answer. > > > > > > > > Why after one-hot-encoding it still outputs greater than 0.5 or less > than? Does sklearn website have a working example on categorical input? > > > > > > > > Thanks! > > > > > > > > On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka < > mail at sebastianraschka.com> wrote: > > > > Like Nicolas said, the 0.5 is just a workaround but will do the > right thing on the one-hot encoded variables, here. You will find that the > threshold is always at 0.5 for these variables. I.e., what it will do is to > use the following conversion: > > > > > > > > treat as car_Audi=1 if car_Audi >= 0.5 > > > > treat as car_Audi=0 if car_Audi < 0.5 > > > > > > > > or, it may be > > > > > > > > treat as car_Audi=1 if car_Audi > 0.5 > > > > treat as car_Audi=0 if car_Audi <= 0.5 > > > > > > > > (Forgot which one sklearn is using, but either way. it will be fine.) > > > > > > > > Best, > > > > Sebastian > > > > > > > > > > > >> On Oct 4, 2019, at 1:44 PM, Nicolas Hug wrote: > > > >> > > > >> > > > >>> But, decision tree is still mistaking one-hot-encoding as > numerical input and split at 0.5. This is not right. Perhaps, I'm doing > something wrong? > > > >> > > > >> You're not doing anything wrong, and neither is the tree. Trees > don't support categorical variables in sklearn, so everything is treated as > numerical. > > > >> > > > >> This is why we do one-hot-encoding: so that a set of numerical (one > hot encoded) features can be treated as if they were just one categorical > feature. > > > >> > > > >> > > > >> > > > >> Nicolas > > > >> > > > >> On 10/4/19 2:01 PM, C W wrote: > > > >>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, > typo on my part. > > > >>> > > > >>> Looks like I did one-hot-encoding correctly. My new variable names > are: car_Audi, car_BMW, etc. > > > >>> > > > >>> But, decision tree is still mistaking one-hot-encoding as > numerical input and split at 0.5. This is not right. Perhaps, I'm doing > something wrong? > > > >>> > > > >>> Is there a good toy example on the sklearn website? I am only see > this: > https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html > . > > > >>> > > > >>> Thanks! > > > >>> > > > >>> > > > >>> > > > >>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka < > mail at sebastianraschka.com> wrote: > > > >>> Hi, > > > >>> > > > >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, > Toyota=1, Audi=2) as numerical values, not category.The tree splits at 0.5 > and 1.5 > > > >>> > > > >>> that's not a onehot encoding then. > > > >>> > > > >>> For an Audi datapoint, it should be > > > >>> > > > >>> BMW=0 > > > >>> Toyota=0 > > > >>> Audi=1 > > > >>> > > > >>> for BMW > > > >>> > > > >>> BMW=1 > > > >>> Toyota=0 > > > >>> Audi=0 > > > >>> > > > >>> and for Toyota > > > >>> > > > >>> BMW=0 > > > >>> Toyota=1 > > > >>> Audi=0 > > > >>> > > > >>> The split threshold should then be at 0.5 for any of these > features. > > > >>> > > > >>> Based on your email, I think you were assuming that the DT does > the one-hot encoding internally, which it doesn't. In practice, it is hard > to guess what is a nominal and what is a ordinal variable, so you have to > do the onehot encoding before you give the data to the decision tree. > > > >>> > > > >>> Best, > > > >>> Sebastian > > > >>> > > > >>>> On Oct 4, 2019, at 11:48 AM, C W wrote: > > > >>>> > > > >>>> I'm getting some funny results. I am doing a regression decision > tree, the response variables are assigned to levels. > > > >>>> > > > >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, > Toyota=1, Audi=2) as numerical values, not category. > > > >>>> > > > >>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding > wrong? How does the sklearn know internally 0 vs. 1 is categorical, not > numerical? > > > >>>> > > > >>>> In R for instance, you do as.factor(), which explicitly states > the data type. > > > >>>> > > > >>>> Thank you! > > > >>>> > > > >>>> > > > >>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller < > t3kcit at gmail.com> wrote: > > > >>>> > > > >>>> > > > >>>> On 9/15/19 8:16 AM, Guillaume Lema?tre wrote: > > > >>>>> > > > >>>>> > > > >>>>> On Sat, 14 Sep 2019 at 20:59, C W wrote: > > > >>>>> Thanks, Guillaume. > > > >>>>> Column transformer looks pretty neat. I've also heard though, > this pipeline can be tedious to set up? Specifying what you want for every > feature is a pain. > > > >>>>> > > > >>>>> It would be interesting for us which part of the pipeline is > tedious to set up to know if we can improve something there. > > > >>>>> Do you mean, that you would like to automatically detect of > which type of feature (categorical/numerical) and apply a > > > >>>>> default encoder/scaling such as discuss there: > https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 > > > >>>>> > > > >>>>> IMO, one a user perspective, it would be cleaner in some cases > at the cost of applying blindly a black box > > > >>>>> which might be dangerous. > > > >>>> Also see > https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor > > > >>>> Which basically does that. > > > >>>> > > > >>>> > > > >>>>> > > > >>>>> > > > >>>>> Jaiver, > > > >>>>> Actually, you guessed right. My real data has only one numerical > variable, looks more like this: > > > >>>>> > > > >>>>> Gender Date Income Car Attendance > > > >>>>> Male 2019/3/01 10000 BMW Yes > > > >>>>> Female 2019/5/02 9000 Toyota No > > > >>>>> Male 2019/7/15 12000 Audi Yes > > > >>>>> > > > >>>>> I am predicting income using all other categorical variables. > Maybe it is catboost! > > > >>>>> > > > >>>>> Thanks, > > > >>>>> > > > >>>>> M > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier L?pez > wrote: > > > >>>>> If you have datasets with many categorical features, and perhaps > many categories, the tools in sklearn are quite limited, > > > >>>>> but there are alternative implementations of boosted trees that > are designed with categorical features in mind. Take a look > > > >>>>> at catboost [1], which has an sklearn-compatible API. > > > >>>>> > > > >>>>> J > > > >>>>> > > > >>>>> [1] https://catboost.ai/ > > > >>>>> > > > >>>>> On Sat, Sep 14, 2019 at 3:40 AM C W wrote: > > > >>>>> Hello all, > > > >>>>> I'm very confused. Can the decision tree module handle both > continuous and categorical features in the dataset? In this case, it's just > CART (Classification and Regression Trees). > > > >>>>> > > > >>>>> For example, > > > >>>>> Gender Age Income Car Attendance > > > >>>>> Male 30 10000 BMW Yes > > > >>>>> Female 35 9000 Toyota No > > > >>>>> Male 50 12000 Audi Yes > > > >>>>> > > > >>>>> According to the documentation > https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart, > it can not! > > > >>>>> > > > >>>>> It says: "scikit-learn implementation does not support > categorical variables for now". > > > >>>>> > > > >>>>> Is this true? If not, can someone point me to an example? If > yes, what do people do? > > > >>>>> > > > >>>>> Thank you very much! > > > >>>>> > > > >>>>> > > > >>>>> > > > >>>>> _______________________________________________ > > > >>>>> scikit-learn mailing list > > > >>>>> scikit-learn at python.org > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > > >>>>> _______________________________________________ > > > >>>>> scikit-learn mailing list > > > >>>>> scikit-learn at python.org > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > > >>>>> _______________________________________________ > > > >>>>> scikit-learn mailing list > > > >>>>> scikit-learn at python.org > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > > >>>>> > > > >>>>> > > > >>>>> -- > > > >>>>> Guillaume Lemaitre > > > >>>>> INRIA Saclay - Parietal team > > > >>>>> Center for Data Science Paris-Saclay > > > >>>>> https://glemaitre.github.io/ > > > >>>>> > > > >>>>> > > > >>>>> _______________________________________________ > > > >>>>> scikit-learn mailing list > > > >>>>> > > > >>>>> scikit-learn at python.org > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > > >>>> > > > >>>> _______________________________________________ > > > >>>> scikit-learn mailing list > > > >>>> scikit-learn at python.org > > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > > > >>>> _______________________________________________ > > > >>>> scikit-learn mailing list > > > >>>> scikit-learn at python.org > > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > > > >>> > > > >>> _______________________________________________ > > > >>> scikit-learn mailing list > > > >>> scikit-learn at python.org > > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > > >>> > > > >>> > > > >>> _______________________________________________ > > > >>> scikit-learn mailing list > > > >>> > > > >>> scikit-learn at python.org > > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > > >> _______________________________________________ > > > >> scikit-learn mailing list > > > >> scikit-learn at python.org > > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > _______________________________________________ > > > > scikit-learn mailing list > > > > scikit-learn at python.org > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > > > scikit-learn mailing list > > > > scikit-learn at python.org > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From javaeurusd at gmail.com Sat Oct 5 14:55:33 2019 From: javaeurusd at gmail.com (Mike Smith) Date: Sat, 5 Oct 2019 11:55:33 -0700 Subject: [scikit-learn] scikit-learn Digest, Vol 43, Issue 10 In-Reply-To: References: Message-ID: 1. Re: Can Scikit-learn decision tree (CART) have both continuous and categorical features? (C W) What I'd ask in reply to this is if regression and classification module results can be entered into an input for one resultant output. On Sat, Oct 5, 2019, 11:50 AM , wrote: > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Re: Can Scikit-learn decision tree (CART) have both > continuous and categorical features? (C W) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sat, 5 Oct 2019 14:50:09 -0400 > From: C W > To: Scikit-learn mailing list > Subject: Re: [scikit-learn] Can Scikit-learn decision tree (CART) have > both continuous and categorical features? > Message-ID: > < > CAE2FW2nHDJGNky2VWk-U8fU3gqwBqWEgidzTAWnUq+NzAK68VA at mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Thanks, great material! I got pydotplus with graphviz to work. > > Using the code on sklean website [1], tree.plot_tree(clf.fit(iris.data, > iris.target)) gives an error: > AttributeError: module 'sklearn.tree' has no attribute 'plot_tree' > > Both my colleague and I got the same error message. Per this post > https://github.com/Microsoft/LightGBM/issues/1844, a PyPI update is > needed. > > [1] sklearn link: > https://scikit-learn.org/stable/modules/tree.html#classification > > > On Fri, Oct 4, 2019 at 11:52 PM Sebastian Raschka < > mail at sebastianraschka.com> > wrote: > > > The docs show a way such that you don't need to write it as png file > using > > tree.plot_tree: > > https://scikit-learn.org/stable/modules/tree.html#classification > > > > I don't remember why, but I think I had problems with that in the past (I > > think it didn't look so nice visually, but don't remember), which is why > I > > still stick to graphviz. For my use cases, it's not much hassle -- it > used > > to be a bit of a hassle to get GraphViz working, but now you can do > > > > conda install pydotplus > > conda install graphviz > > > > Coincidentally, I just made an example for a lecture I was teaching on > > Tue: > > > https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/06_trees/code/06-trees_demo.ipynb > > > > Best, > > Sebastian > > > > > > > On Oct 4, 2019, at 10:09 PM, C W wrote: > > > > > > On a separate note, what do you use for plotting? > > > > > > I found graphviz, but you have to first save it as a png on your > > computer. That's a lot work for just one plot. Is there something like a > > matplotlib? > > > > > > Thanks! > > > > > > On Fri, Oct 4, 2019 at 9:42 PM Sebastian Raschka < > > mail at sebastianraschka.com> wrote: > > > Yeah, think of it more as a computational workaround for achieving the > > same thing more efficiently (although it looks inelegant/weird)-- > something > > like that wouldn't be mentioned in textbooks. > > > > > > Best, > > > Sebastian > > > > > > > On Oct 4, 2019, at 6:33 PM, C W wrote: > > > > > > > > Thanks Sebastian, I think I get it. > > > > > > > > It's just have never seen it this way. Quite different from what I'm > > used in Elements of Statistical Learning. > > > > > > > > On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka < > > mail at sebastianraschka.com> wrote: > > > > Not sure if there's a website for that. In any case, to explain this > > differently, as discussed earlier sklearn assumes continuous features for > > decision trees. So, it will use a binary threshold for splitting along a > > feature attribute. In other words, it cannot do sth like > > > > > > > > if x == 1 then right child node > > > > else left child node > > > > > > > > Instead, what it does is > > > > > > > > if x >= 0.5 then right child node > > > > else left child node > > > > > > > > These are basically equivalent as you can see when you just plug in > > values 0 and 1 for x. > > > > > > > > Best, > > > > Sebastian > > > > > > > > > On Oct 4, 2019, at 5:34 PM, C W wrote: > > > > > > > > > > I don't understand your answer. > > > > > > > > > > Why after one-hot-encoding it still outputs greater than 0.5 or > less > > than? Does sklearn website have a working example on categorical input? > > > > > > > > > > Thanks! > > > > > > > > > > On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka < > > mail at sebastianraschka.com> wrote: > > > > > Like Nicolas said, the 0.5 is just a workaround but will do the > > right thing on the one-hot encoded variables, here. You will find that > the > > threshold is always at 0.5 for these variables. I.e., what it will do is > to > > use the following conversion: > > > > > > > > > > treat as car_Audi=1 if car_Audi >= 0.5 > > > > > treat as car_Audi=0 if car_Audi < 0.5 > > > > > > > > > > or, it may be > > > > > > > > > > treat as car_Audi=1 if car_Audi > 0.5 > > > > > treat as car_Audi=0 if car_Audi <= 0.5 > > > > > > > > > > (Forgot which one sklearn is using, but either way. it will be > fine.) > > > > > > > > > > Best, > > > > > Sebastian > > > > > > > > > > > > > > >> On Oct 4, 2019, at 1:44 PM, Nicolas Hug wrote: > > > > >> > > > > >> > > > > >>> But, decision tree is still mistaking one-hot-encoding as > > numerical input and split at 0.5. This is not right. Perhaps, I'm doing > > something wrong? > > > > >> > > > > >> You're not doing anything wrong, and neither is the tree. Trees > > don't support categorical variables in sklearn, so everything is treated > as > > numerical. > > > > >> > > > > >> This is why we do one-hot-encoding: so that a set of numerical > (one > > hot encoded) features can be treated as if they were just one categorical > > feature. > > > > >> > > > > >> > > > > >> > > > > >> Nicolas > > > > >> > > > > >> On 10/4/19 2:01 PM, C W wrote: > > > > >>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, > > typo on my part. > > > > >>> > > > > >>> Looks like I did one-hot-encoding correctly. My new variable > names > > are: car_Audi, car_BMW, etc. > > > > >>> > > > > >>> But, decision tree is still mistaking one-hot-encoding as > > numerical input and split at 0.5. This is not right. Perhaps, I'm doing > > something wrong? > > > > >>> > > > > >>> Is there a good toy example on the sklearn website? I am only see > > this: > > > https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html > > . > > > > >>> > > > > >>> Thanks! > > > > >>> > > > > >>> > > > > >>> > > > > >>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka < > > mail at sebastianraschka.com> wrote: > > > > >>> Hi, > > > > >>> > > > > >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, > > Toyota=1, Audi=2) as numerical values, not category.The tree splits at > 0.5 > > and 1.5 > > > > >>> > > > > >>> that's not a onehot encoding then. > > > > >>> > > > > >>> For an Audi datapoint, it should be > > > > >>> > > > > >>> BMW=0 > > > > >>> Toyota=0 > > > > >>> Audi=1 > > > > >>> > > > > >>> for BMW > > > > >>> > > > > >>> BMW=1 > > > > >>> Toyota=0 > > > > >>> Audi=0 > > > > >>> > > > > >>> and for Toyota > > > > >>> > > > > >>> BMW=0 > > > > >>> Toyota=1 > > > > >>> Audi=0 > > > > >>> > > > > >>> The split threshold should then be at 0.5 for any of these > > features. > > > > >>> > > > > >>> Based on your email, I think you were assuming that the DT does > > the one-hot encoding internally, which it doesn't. In practice, it is > hard > > to guess what is a nominal and what is a ordinal variable, so you have to > > do the onehot encoding before you give the data to the decision tree. > > > > >>> > > > > >>> Best, > > > > >>> Sebastian > > > > >>> > > > > >>>> On Oct 4, 2019, at 11:48 AM, C W wrote: > > > > >>>> > > > > >>>> I'm getting some funny results. I am doing a regression decision > > tree, the response variables are assigned to levels. > > > > >>>> > > > > >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, > > Toyota=1, Audi=2) as numerical values, not category. > > > > >>>> > > > > >>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding > > wrong? How does the sklearn know internally 0 vs. 1 is categorical, not > > numerical? > > > > >>>> > > > > >>>> In R for instance, you do as.factor(), which explicitly states > > the data type. > > > > >>>> > > > > >>>> Thank you! > > > > >>>> > > > > >>>> > > > > >>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller < > > t3kcit at gmail.com> wrote: > > > > >>>> > > > > >>>> > > > > >>>> On 9/15/19 8:16 AM, Guillaume Lema?tre wrote: > > > > >>>>> > > > > >>>>> > > > > >>>>> On Sat, 14 Sep 2019 at 20:59, C W wrote: > > > > >>>>> Thanks, Guillaume. > > > > >>>>> Column transformer looks pretty neat. I've also heard though, > > this pipeline can be tedious to set up? Specifying what you want for > every > > feature is a pain. > > > > >>>>> > > > > >>>>> It would be interesting for us which part of the pipeline is > > tedious to set up to know if we can improve something there. > > > > >>>>> Do you mean, that you would like to automatically detect of > > which type of feature (categorical/numerical) and apply a > > > > >>>>> default encoder/scaling such as discuss there: > > > https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 > > > > >>>>> > > > > >>>>> IMO, one a user perspective, it would be cleaner in some cases > > at the cost of applying blindly a black box > > > > >>>>> which might be dangerous. > > > > >>>> Also see > > > https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor > > > > >>>> Which basically does that. > > > > >>>> > > > > >>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> Jaiver, > > > > >>>>> Actually, you guessed right. My real data has only one > numerical > > variable, looks more like this: > > > > >>>>> > > > > >>>>> Gender Date Income Car Attendance > > > > >>>>> Male 2019/3/01 10000 BMW Yes > > > > >>>>> Female 2019/5/02 9000 Toyota No > > > > >>>>> Male 2019/7/15 12000 Audi Yes > > > > >>>>> > > > > >>>>> I am predicting income using all other categorical variables. > > Maybe it is catboost! > > > > >>>>> > > > > >>>>> Thanks, > > > > >>>>> > > > > >>>>> M > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier L?pez > > wrote: > > > > >>>>> If you have datasets with many categorical features, and > perhaps > > many categories, the tools in sklearn are quite limited, > > > > >>>>> but there are alternative implementations of boosted trees that > > are designed with categorical features in mind. Take a look > > > > >>>>> at catboost [1], which has an sklearn-compatible API. > > > > >>>>> > > > > >>>>> J > > > > >>>>> > > > > >>>>> [1] https://catboost.ai/ > > > > >>>>> > > > > >>>>> On Sat, Sep 14, 2019 at 3:40 AM C W wrote: > > > > >>>>> Hello all, > > > > >>>>> I'm very confused. Can the decision tree module handle both > > continuous and categorical features in the dataset? In this case, it's > just > > CART (Classification and Regression Trees). > > > > >>>>> > > > > >>>>> For example, > > > > >>>>> Gender Age Income Car Attendance > > > > >>>>> Male 30 10000 BMW Yes > > > > >>>>> Female 35 9000 Toyota No > > > > >>>>> Male 50 12000 Audi Yes > > > > >>>>> > > > > >>>>> According to the documentation > > > https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart > , > > it can not! > > > > >>>>> > > > > >>>>> It says: "scikit-learn implementation does not support > > categorical variables for now". > > > > >>>>> > > > > >>>>> Is this true? If not, can someone point me to an example? If > > yes, what do people do? > > > > >>>>> > > > > >>>>> Thank you very much! > > > > >>>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> _______________________________________________ > > > > >>>>> scikit-learn mailing list > > > > >>>>> scikit-learn at python.org > > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > > > >>>>> _______________________________________________ > > > > >>>>> scikit-learn mailing list > > > > >>>>> scikit-learn at python.org > > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > > > >>>>> _______________________________________________ > > > > >>>>> scikit-learn mailing list > > > > >>>>> scikit-learn at python.org > > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > > > >>>>> > > > > >>>>> > > > > >>>>> -- > > > > >>>>> Guillaume Lemaitre > > > > >>>>> INRIA Saclay - Parietal team > > > > >>>>> Center for Data Science Paris-Saclay > > > > >>>>> https://glemaitre.github.io/ > > > > >>>>> > > > > >>>>> > > > > >>>>> _______________________________________________ > > > > >>>>> scikit-learn mailing list > > > > >>>>> > > > > >>>>> scikit-learn at python.org > > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > > > >>>> > > > > >>>> _______________________________________________ > > > > >>>> scikit-learn mailing list > > > > >>>> scikit-learn at python.org > > > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > > > > >>>> _______________________________________________ > > > > >>>> scikit-learn mailing list > > > > >>>> scikit-learn at python.org > > > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > > > > >>> > > > > >>> _______________________________________________ > > > > >>> scikit-learn mailing list > > > > >>> scikit-learn at python.org > > > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > > > >>> > > > > >>> > > > > >>> _______________________________________________ > > > > >>> scikit-learn mailing list > > > > >>> > > > > >>> scikit-learn at python.org > > > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > > > >> _______________________________________________ > > > > >> scikit-learn mailing list > > > > >> scikit-learn at python.org > > > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > _______________________________________________ > > > > > scikit-learn mailing list > > > > > scikit-learn at python.org > > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > > > > > scikit-learn mailing list > > > > > scikit-learn at python.org > > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > _______________________________________________ > > > > scikit-learn mailing list > > > > scikit-learn at python.org > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > > > scikit-learn mailing list > > > > scikit-learn at python.org > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://mail.python.org/pipermail/scikit-learn/attachments/20191005/7234be32/attachment.html > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 43, Issue 10 > ******************************************** > -------------- next part -------------- An HTML attachment was scrubbed... URL: From javaeurusd at gmail.com Sun Oct 6 04:55:28 2019 From: javaeurusd at gmail.com (Mike Smith) Date: Sun, 6 Oct 2019 01:55:28 -0700 Subject: [scikit-learn] scikit-learn Digest, Vol 43, Issue 11 In-Reply-To: References: Message-ID: Can I call an MSExcel cell range in a function such as model.predict(), instead of typing the data in for each element? On Sat, Oct 5, 2019 at 11:58 AM wrote: > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Re: scikit-learn Digest, Vol 43, Issue 10 (Mike Smith) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sat, 5 Oct 2019 11:55:33 -0700 > From: Mike Smith > To: scikit-learn at python.org > Subject: Re: [scikit-learn] scikit-learn Digest, Vol 43, Issue 10 > Message-ID: > A at mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > 1. Re: Can Scikit-learn decision tree (CART) have both > continuous and categorical features? (C W) > > What I'd ask in reply to this is if regression and classification module > results can be entered into an input for one resultant output. > > > > On Sat, Oct 5, 2019, 11:50 AM , wrote: > > > Send scikit-learn mailing list submissions to > > scikit-learn at python.org > > > > To subscribe or unsubscribe via the World Wide Web, visit > > https://mail.python.org/mailman/listinfo/scikit-learn > > or, via email, send a message with subject or body 'help' to > > scikit-learn-request at python.org > > > > You can reach the person managing the list at > > scikit-learn-owner at python.org > > > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of scikit-learn digest..." > > > > > > Today's Topics: > > > > 1. Re: Can Scikit-learn decision tree (CART) have both > > continuous and categorical features? (C W) > > > > > > ---------------------------------------------------------------------- > > > > Message: 1 > > Date: Sat, 5 Oct 2019 14:50:09 -0400 > > From: C W > > To: Scikit-learn mailing list > > Subject: Re: [scikit-learn] Can Scikit-learn decision tree (CART) have > > both continuous and categorical features? > > Message-ID: > > < > > CAE2FW2nHDJGNky2VWk-U8fU3gqwBqWEgidzTAWnUq+NzAK68VA at mail.gmail.com> > > Content-Type: text/plain; charset="utf-8" > > > > Thanks, great material! I got pydotplus with graphviz to work. > > > > Using the code on sklean website [1], tree.plot_tree(clf.fit(iris.data, > > iris.target)) gives an error: > > AttributeError: module 'sklearn.tree' has no attribute 'plot_tree' > > > > Both my colleague and I got the same error message. Per this post > > https://github.com/Microsoft/LightGBM/issues/1844, a PyPI update is > > needed. > > > > [1] sklearn link: > > https://scikit-learn.org/stable/modules/tree.html#classification > > > > > > On Fri, Oct 4, 2019 at 11:52 PM Sebastian Raschka < > > mail at sebastianraschka.com> > > wrote: > > > > > The docs show a way such that you don't need to write it as png file > > using > > > tree.plot_tree: > > > https://scikit-learn.org/stable/modules/tree.html#classification > > > > > > I don't remember why, but I think I had problems with that in the past > (I > > > think it didn't look so nice visually, but don't remember), which is > why > > I > > > still stick to graphviz. For my use cases, it's not much hassle -- it > > used > > > to be a bit of a hassle to get GraphViz working, but now you can do > > > > > > conda install pydotplus > > > conda install graphviz > > > > > > Coincidentally, I just made an example for a lecture I was teaching on > > > Tue: > > > > > > https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/06_trees/code/06-trees_demo.ipynb > > > > > > Best, > > > Sebastian > > > > > > > > > > On Oct 4, 2019, at 10:09 PM, C W wrote: > > > > > > > > On a separate note, what do you use for plotting? > > > > > > > > I found graphviz, but you have to first save it as a png on your > > > computer. That's a lot work for just one plot. Is there something like > a > > > matplotlib? > > > > > > > > Thanks! > > > > > > > > On Fri, Oct 4, 2019 at 9:42 PM Sebastian Raschka < > > > mail at sebastianraschka.com> wrote: > > > > Yeah, think of it more as a computational workaround for achieving > the > > > same thing more efficiently (although it looks inelegant/weird)-- > > something > > > like that wouldn't be mentioned in textbooks. > > > > > > > > Best, > > > > Sebastian > > > > > > > > > On Oct 4, 2019, at 6:33 PM, C W wrote: > > > > > > > > > > Thanks Sebastian, I think I get it. > > > > > > > > > > It's just have never seen it this way. Quite different from what > I'm > > > used in Elements of Statistical Learning. > > > > > > > > > > On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka < > > > mail at sebastianraschka.com> wrote: > > > > > Not sure if there's a website for that. In any case, to explain > this > > > differently, as discussed earlier sklearn assumes continuous features > for > > > decision trees. So, it will use a binary threshold for splitting along > a > > > feature attribute. In other words, it cannot do sth like > > > > > > > > > > if x == 1 then right child node > > > > > else left child node > > > > > > > > > > Instead, what it does is > > > > > > > > > > if x >= 0.5 then right child node > > > > > else left child node > > > > > > > > > > These are basically equivalent as you can see when you just plug in > > > values 0 and 1 for x. > > > > > > > > > > Best, > > > > > Sebastian > > > > > > > > > > > On Oct 4, 2019, at 5:34 PM, C W wrote: > > > > > > > > > > > > I don't understand your answer. > > > > > > > > > > > > Why after one-hot-encoding it still outputs greater than 0.5 or > > less > > > than? Does sklearn website have a working example on categorical input? > > > > > > > > > > > > Thanks! > > > > > > > > > > > > On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka < > > > mail at sebastianraschka.com> wrote: > > > > > > Like Nicolas said, the 0.5 is just a workaround but will do the > > > right thing on the one-hot encoded variables, here. You will find that > > the > > > threshold is always at 0.5 for these variables. I.e., what it will do > is > > to > > > use the following conversion: > > > > > > > > > > > > treat as car_Audi=1 if car_Audi >= 0.5 > > > > > > treat as car_Audi=0 if car_Audi < 0.5 > > > > > > > > > > > > or, it may be > > > > > > > > > > > > treat as car_Audi=1 if car_Audi > 0.5 > > > > > > treat as car_Audi=0 if car_Audi <= 0.5 > > > > > > > > > > > > (Forgot which one sklearn is using, but either way. it will be > > fine.) > > > > > > > > > > > > Best, > > > > > > Sebastian > > > > > > > > > > > > > > > > > >> On Oct 4, 2019, at 1:44 PM, Nicolas Hug > wrote: > > > > > >> > > > > > >> > > > > > >>> But, decision tree is still mistaking one-hot-encoding as > > > numerical input and split at 0.5. This is not right. Perhaps, I'm doing > > > something wrong? > > > > > >> > > > > > >> You're not doing anything wrong, and neither is the tree. Trees > > > don't support categorical variables in sklearn, so everything is > treated > > as > > > numerical. > > > > > >> > > > > > >> This is why we do one-hot-encoding: so that a set of numerical > > (one > > > hot encoded) features can be treated as if they were just one > categorical > > > feature. > > > > > >> > > > > > >> > > > > > >> > > > > > >> Nicolas > > > > > >> > > > > > >> On 10/4/19 2:01 PM, C W wrote: > > > > > >>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, > > > typo on my part. > > > > > >>> > > > > > >>> Looks like I did one-hot-encoding correctly. My new variable > > names > > > are: car_Audi, car_BMW, etc. > > > > > >>> > > > > > >>> But, decision tree is still mistaking one-hot-encoding as > > > numerical input and split at 0.5. This is not right. Perhaps, I'm doing > > > something wrong? > > > > > >>> > > > > > >>> Is there a good toy example on the sklearn website? I am only > see > > > this: > > > > > > https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html > > > . > > > > > >>> > > > > > >>> Thanks! > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka < > > > mail at sebastianraschka.com> wrote: > > > > > >>> Hi, > > > > > >>> > > > > > >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, > > > Toyota=1, Audi=2) as numerical values, not category.The tree splits at > > 0.5 > > > and 1.5 > > > > > >>> > > > > > >>> that's not a onehot encoding then. > > > > > >>> > > > > > >>> For an Audi datapoint, it should be > > > > > >>> > > > > > >>> BMW=0 > > > > > >>> Toyota=0 > > > > > >>> Audi=1 > > > > > >>> > > > > > >>> for BMW > > > > > >>> > > > > > >>> BMW=1 > > > > > >>> Toyota=0 > > > > > >>> Audi=0 > > > > > >>> > > > > > >>> and for Toyota > > > > > >>> > > > > > >>> BMW=0 > > > > > >>> Toyota=1 > > > > > >>> Audi=0 > > > > > >>> > > > > > >>> The split threshold should then be at 0.5 for any of these > > > features. > > > > > >>> > > > > > >>> Based on your email, I think you were assuming that the DT does > > > the one-hot encoding internally, which it doesn't. In practice, it is > > hard > > > to guess what is a nominal and what is a ordinal variable, so you have > to > > > do the onehot encoding before you give the data to the decision tree. > > > > > >>> > > > > > >>> Best, > > > > > >>> Sebastian > > > > > >>> > > > > > >>>> On Oct 4, 2019, at 11:48 AM, C W wrote: > > > > > >>>> > > > > > >>>> I'm getting some funny results. I am doing a regression > decision > > > tree, the response variables are assigned to levels. > > > > > >>>> > > > > > >>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, > > > Toyota=1, Audi=2) as numerical values, not category. > > > > > >>>> > > > > > >>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding > > > wrong? How does the sklearn know internally 0 vs. 1 is categorical, not > > > numerical? > > > > > >>>> > > > > > >>>> In R for instance, you do as.factor(), which explicitly states > > > the data type. > > > > > >>>> > > > > > >>>> Thank you! > > > > > >>>> > > > > > >>>> > > > > > >>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller < > > > t3kcit at gmail.com> wrote: > > > > > >>>> > > > > > >>>> > > > > > >>>> On 9/15/19 8:16 AM, Guillaume Lema?tre wrote: > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> On Sat, 14 Sep 2019 at 20:59, C W wrote: > > > > > >>>>> Thanks, Guillaume. > > > > > >>>>> Column transformer looks pretty neat. I've also heard though, > > > this pipeline can be tedious to set up? Specifying what you want for > > every > > > feature is a pain. > > > > > >>>>> > > > > > >>>>> It would be interesting for us which part of the pipeline is > > > tedious to set up to know if we can improve something there. > > > > > >>>>> Do you mean, that you would like to automatically detect of > > > which type of feature (categorical/numerical) and apply a > > > > > >>>>> default encoder/scaling such as discuss there: > > > > > > https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 > > > > > >>>>> > > > > > >>>>> IMO, one a user perspective, it would be cleaner in some > cases > > > at the cost of applying blindly a black box > > > > > >>>>> which might be dangerous. > > > > > >>>> Also see > > > > > > https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor > > > > > >>>> Which basically does that. > > > > > >>>> > > > > > >>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> Jaiver, > > > > > >>>>> Actually, you guessed right. My real data has only one > > numerical > > > variable, looks more like this: > > > > > >>>>> > > > > > >>>>> Gender Date Income Car Attendance > > > > > >>>>> Male 2019/3/01 10000 BMW Yes > > > > > >>>>> Female 2019/5/02 9000 Toyota No > > > > > >>>>> Male 2019/7/15 12000 Audi Yes > > > > > >>>>> > > > > > >>>>> I am predicting income using all other categorical variables. > > > Maybe it is catboost! > > > > > >>>>> > > > > > >>>>> Thanks, > > > > > >>>>> > > > > > >>>>> M > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier L?pez > > > > wrote: > > > > > >>>>> If you have datasets with many categorical features, and > > perhaps > > > many categories, the tools in sklearn are quite limited, > > > > > >>>>> but there are alternative implementations of boosted trees > that > > > are designed with categorical features in mind. Take a look > > > > > >>>>> at catboost [1], which has an sklearn-compatible API. > > > > > >>>>> > > > > > >>>>> J > > > > > >>>>> > > > > > >>>>> [1] https://catboost.ai/ > > > > > >>>>> > > > > > >>>>> On Sat, Sep 14, 2019 at 3:40 AM C W > wrote: > > > > > >>>>> Hello all, > > > > > >>>>> I'm very confused. Can the decision tree module handle both > > > continuous and categorical features in the dataset? In this case, it's > > just > > > CART (Classification and Regression Trees). > > > > > >>>>> > > > > > >>>>> For example, > > > > > >>>>> Gender Age Income Car Attendance > > > > > >>>>> Male 30 10000 BMW Yes > > > > > >>>>> Female 35 9000 Toyota No > > > > > >>>>> Male 50 12000 Audi Yes > > > > > >>>>> > > > > > >>>>> According to the documentation > > > > > > https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart > > , > > > it can not! > > > > > >>>>> > > > > > >>>>> It says: "scikit-learn implementation does not support > > > categorical variables for now". > > > > > >>>>> > > > > > >>>>> Is this true? If not, can someone point me to an example? If > > > yes, what do people do? > > > > > >>>>> > > > > > >>>>> Thank you very much! > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> _______________________________________________ > > > > > >>>>> scikit-learn mailing list > > > > > >>>>> scikit-learn at python.org > > > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > >>>>> _______________________________________________ > > > > > >>>>> scikit-learn mailing list > > > > > >>>>> scikit-learn at python.org > > > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > >>>>> _______________________________________________ > > > > > >>>>> scikit-learn mailing list > > > > > >>>>> scikit-learn at python.org > > > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> -- > > > > > >>>>> Guillaume Lemaitre > > > > > >>>>> INRIA Saclay - Parietal team > > > > > >>>>> Center for Data Science Paris-Saclay > > > > > >>>>> https://glemaitre.github.io/ > > > > > >>>>> > > > > > >>>>> > > > > > >>>>> _______________________________________________ > > > > > >>>>> scikit-learn mailing list > > > > > >>>>> > > > > > >>>>> scikit-learn at python.org > > > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > >>>> > > > > > >>>> _______________________________________________ > > > > > >>>> scikit-learn mailing list > > > > > >>>> scikit-learn at python.org > > > > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > >>>> _______________________________________________ > > > > > >>>> scikit-learn mailing list > > > > > >>>> scikit-learn at python.org > > > > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > >>> > > > > > >>> _______________________________________________ > > > > > >>> scikit-learn mailing list > > > > > >>> scikit-learn at python.org > > > > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > >>> > > > > > >>> > > > > > >>> _______________________________________________ > > > > > >>> scikit-learn mailing list > > > > > >>> > > > > > >>> scikit-learn at python.org > > > > > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > >> _______________________________________________ > > > > > >> scikit-learn mailing list > > > > > >> scikit-learn at python.org > > > > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > > > _______________________________________________ > > > > > > scikit-learn mailing list > > > > > > scikit-learn at python.org > > > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > > > > scikit-learn mailing list > > > > > > scikit-learn at python.org > > > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > _______________________________________________ > > > > > scikit-learn mailing list > > > > > scikit-learn at python.org > > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > > > > > scikit-learn mailing list > > > > > scikit-learn at python.org > > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > _______________________________________________ > > > > scikit-learn mailing list > > > > scikit-learn at python.org > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > > > scikit-learn mailing list > > > > scikit-learn at python.org > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -------------- next part -------------- > > An HTML attachment was scrubbed... > > URL: < > > > http://mail.python.org/pipermail/scikit-learn/attachments/20191005/7234be32/attachment.html > > > > > > > ------------------------------ > > > > Subject: Digest Footer > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > ------------------------------ > > > > End of scikit-learn Digest, Vol 43, Issue 10 > > ******************************************** > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://mail.python.org/pipermail/scikit-learn/attachments/20191005/14272924/attachment.html > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 43, Issue 11 > ******************************************** > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Sun Oct 6 10:10:31 2019 From: t3kcit at gmail.com (Andreas Mueller) Date: Sun, 6 Oct 2019 16:10:31 +0200 Subject: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features? In-Reply-To: <4FC33890-94D3-4AA8-8FA9-EF1FADFD4C20@sebastianraschka.com> References: <5e9661ff-dfb2-cc2e-b71f-ba18024374a1@gmail.com> <7E3EE86D-4B8A-438A-B03A-8DFC8E1D8AB4@sebastianraschka.com> <7A0589D1-D990-4FD6-9D11-AA804E34F3BC@sebastianraschka.com> <4FC33890-94D3-4AA8-8FA9-EF1FADFD4C20@sebastianraschka.com> Message-ID: <3d6e9116-43bf-77d3-dfeb-ec6c91041748@gmail.com> On 10/4/19 11:28 PM, Sebastian Raschka wrote: > The docs show a way such that you don't need to write it as png file using tree.plot_tree: > https://scikit-learn.org/stable/modules/tree.html#classification > > I don't remember why, but I think I had problems with that in the past (I think it didn't look so nice visually, but don't remember), which is why I still stick to graphviz. Can you give me examples that don't look as nice? I would love to improve it. > For my use cases, it's not much hassle -- it used to be a bit of a hassle to get GraphViz working, but now you can do > > conda install pydotplus > conda install graphviz > > Coincidentally, I just made an example for a lecture I was teaching on Tue: https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/06_trees/code/06-trees_demo.ipynb > > Best, > Sebastian > > >> On Oct 4, 2019, at 10:09 PM, C W wrote: >> >> On a separate note, what do you use for plotting? >> >> I found graphviz, but you have to first save it as a png on your computer. That's a lot work for just one plot. Is there something like a matplotlib? >> >> Thanks! >> >> On Fri, Oct 4, 2019 at 9:42 PM Sebastian Raschka wrote: >> Yeah, think of it more as a computational workaround for achieving the same thing more efficiently (although it looks inelegant/weird)-- something like that wouldn't be mentioned in textbooks. >> >> Best, >> Sebastian >> >>> On Oct 4, 2019, at 6:33 PM, C W wrote: >>> >>> Thanks Sebastian, I think I get it. >>> >>> It's just have never seen it this way. Quite different from what I'm used in Elements of Statistical Learning. >>> >>> On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka wrote: >>> Not sure if there's a website for that. In any case, to explain this differently, as discussed earlier sklearn assumes continuous features for decision trees. So, it will use a binary threshold for splitting along a feature attribute. In other words, it cannot do sth like >>> >>> if x == 1 then right child node >>> else left child node >>> >>> Instead, what it does is >>> >>> if x >= 0.5 then right child node >>> else left child node >>> >>> These are basically equivalent as you can see when you just plug in values 0 and 1 for x. >>> >>> Best, >>> Sebastian >>> >>>> On Oct 4, 2019, at 5:34 PM, C W wrote: >>>> >>>> I don't understand your answer. >>>> >>>> Why after one-hot-encoding it still outputs greater than 0.5 or less than? Does sklearn website have a working example on categorical input? >>>> >>>> Thanks! >>>> >>>> On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka wrote: >>>> Like Nicolas said, the 0.5 is just a workaround but will do the right thing on the one-hot encoded variables, here. You will find that the threshold is always at 0.5 for these variables. I.e., what it will do is to use the following conversion: >>>> >>>> treat as car_Audi=1 if car_Audi >= 0.5 >>>> treat as car_Audi=0 if car_Audi < 0.5 >>>> >>>> or, it may be >>>> >>>> treat as car_Audi=1 if car_Audi > 0.5 >>>> treat as car_Audi=0 if car_Audi <= 0.5 >>>> >>>> (Forgot which one sklearn is using, but either way. it will be fine.) >>>> >>>> Best, >>>> Sebastian >>>> >>>> >>>>> On Oct 4, 2019, at 1:44 PM, Nicolas Hug wrote: >>>>> >>>>> >>>>>> But, decision tree is still mistaking one-hot-encoding as numerical input and split at 0.5. This is not right. Perhaps, I'm doing something wrong? >>>>> You're not doing anything wrong, and neither is the tree. Trees don't support categorical variables in sklearn, so everything is treated as numerical. >>>>> >>>>> This is why we do one-hot-encoding: so that a set of numerical (one hot encoded) features can be treated as if they were just one categorical feature. >>>>> >>>>> >>>>> >>>>> Nicolas >>>>> >>>>> On 10/4/19 2:01 PM, C W wrote: >>>>>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my part. >>>>>> >>>>>> Looks like I did one-hot-encoding correctly. My new variable names are: car_Audi, car_BMW, etc. >>>>>> >>>>>> But, decision tree is still mistaking one-hot-encoding as numerical input and split at 0.5. This is not right. Perhaps, I'm doing something wrong? >>>>>> >>>>>> Is there a good toy example on the sklearn website? I am only see this: https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka wrote: >>>>>> Hi, >>>>>> >>>>>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5 >>>>>> that's not a onehot encoding then. >>>>>> >>>>>> For an Audi datapoint, it should be >>>>>> >>>>>> BMW=0 >>>>>> Toyota=0 >>>>>> Audi=1 >>>>>> >>>>>> for BMW >>>>>> >>>>>> BMW=1 >>>>>> Toyota=0 >>>>>> Audi=0 >>>>>> >>>>>> and for Toyota >>>>>> >>>>>> BMW=0 >>>>>> Toyota=1 >>>>>> Audi=0 >>>>>> >>>>>> The split threshold should then be at 0.5 for any of these features. >>>>>> >>>>>> Based on your email, I think you were assuming that the DT does the one-hot encoding internally, which it doesn't. In practice, it is hard to guess what is a nominal and what is a ordinal variable, so you have to do the onehot encoding before you give the data to the decision tree. >>>>>> >>>>>> Best, >>>>>> Sebastian >>>>>> >>>>>>> On Oct 4, 2019, at 11:48 AM, C W wrote: >>>>>>> >>>>>>> I'm getting some funny results. I am doing a regression decision tree, the response variables are assigned to levels. >>>>>>> >>>>>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, Audi=2) as numerical values, not category. >>>>>>> >>>>>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How does the sklearn know internally 0 vs. 1 is categorical, not numerical? >>>>>>> >>>>>>> In R for instance, you do as.factor(), which explicitly states the data type. >>>>>>> >>>>>>> Thank you! >>>>>>> >>>>>>> >>>>>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller wrote: >>>>>>> >>>>>>> >>>>>>> On 9/15/19 8:16 AM, Guillaume Lema?tre wrote: >>>>>>>> >>>>>>>> On Sat, 14 Sep 2019 at 20:59, C W wrote: >>>>>>>> Thanks, Guillaume. >>>>>>>> Column transformer looks pretty neat. I've also heard though, this pipeline can be tedious to set up? Specifying what you want for every feature is a pain. >>>>>>>> >>>>>>>> It would be interesting for us which part of the pipeline is tedious to set up to know if we can improve something there. >>>>>>>> Do you mean, that you would like to automatically detect of which type of feature (categorical/numerical) and apply a >>>>>>>> default encoder/scaling such as discuss there: https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 >>>>>>>> >>>>>>>> IMO, one a user perspective, it would be cleaner in some cases at the cost of applying blindly a black box >>>>>>>> which might be dangerous. >>>>>>> Also see https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor >>>>>>> Which basically does that. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Jaiver, >>>>>>>> Actually, you guessed right. My real data has only one numerical variable, looks more like this: >>>>>>>> >>>>>>>> Gender Date Income Car Attendance >>>>>>>> Male 2019/3/01 10000 BMW Yes >>>>>>>> Female 2019/5/02 9000 Toyota No >>>>>>>> Male 2019/7/15 12000 Audi Yes >>>>>>>> >>>>>>>> I am predicting income using all other categorical variables. Maybe it is catboost! >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> M >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier L?pez wrote: >>>>>>>> If you have datasets with many categorical features, and perhaps many categories, the tools in sklearn are quite limited, >>>>>>>> but there are alternative implementations of boosted trees that are designed with categorical features in mind. Take a look >>>>>>>> at catboost [1], which has an sklearn-compatible API. >>>>>>>> >>>>>>>> J >>>>>>>> >>>>>>>> [1] https://catboost.ai/ >>>>>>>> >>>>>>>> On Sat, Sep 14, 2019 at 3:40 AM C W wrote: >>>>>>>> Hello all, >>>>>>>> I'm very confused. Can the decision tree module handle both continuous and categorical features in the dataset? In this case, it's just CART (Classification and Regression Trees). >>>>>>>> >>>>>>>> For example, >>>>>>>> Gender Age Income Car Attendance >>>>>>>> Male 30 10000 BMW Yes >>>>>>>> Female 35 9000 Toyota No >>>>>>>> Male 50 12000 Audi Yes >>>>>>>> >>>>>>>> According to the documentation https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart, it can not! >>>>>>>> >>>>>>>> It says: "scikit-learn implementation does not support categorical variables for now". >>>>>>>> >>>>>>>> Is this true? If not, can someone point me to an example? If yes, what do people do? >>>>>>>> >>>>>>>> Thank you very much! >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> scikit-learn mailing list >>>>>>>> scikit-learn at python.org >>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>> _______________________________________________ >>>>>>>> scikit-learn mailing list >>>>>>>> scikit-learn at python.org >>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>> _______________________________________________ >>>>>>>> scikit-learn mailing list >>>>>>>> scikit-learn at python.org >>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Guillaume Lemaitre >>>>>>>> INRIA Saclay - Parietal team >>>>>>>> Center for Data Science Paris-Saclay >>>>>>>> https://glemaitre.github.io/ >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> scikit-learn mailing list >>>>>>>> >>>>>>>> scikit-learn at python.org >>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn at python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn at python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From mail at sebastianraschka.com Sun Oct 6 10:40:09 2019 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Sun, 6 Oct 2019 09:40:09 -0500 Subject: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features? In-Reply-To: <3d6e9116-43bf-77d3-dfeb-ec6c91041748@gmail.com> References: <5e9661ff-dfb2-cc2e-b71f-ba18024374a1@gmail.com> <7E3EE86D-4B8A-438A-B03A-8DFC8E1D8AB4@sebastianraschka.com> <7A0589D1-D990-4FD6-9D11-AA804E34F3BC@sebastianraschka.com> <4FC33890-94D3-4AA8-8FA9-EF1FADFD4C20@sebastianraschka.com> <3d6e9116-43bf-77d3-dfeb-ec6c91041748@gmail.com> Message-ID: Sure, I just ran an example I made with graphviz via plot_tree, and it looks like there's an issue with overlapping boxes if you use class (and/or feature) names. I made a reproducible example here so that you can take a look: https://github.com/rasbt/bugreport/blob/master/scikit-learn/plot_tree/tree-demo-1.ipynb Happy to add this to the sklearn issue list if there's no issue filed for that yet. Best, Sebastian > On Oct 6, 2019, at 9:10 AM, Andreas Mueller wrote: > > > > On 10/4/19 11:28 PM, Sebastian Raschka wrote: >> The docs show a way such that you don't need to write it as png file using tree.plot_tree: >> https://scikit-learn.org/stable/modules/tree.html#classification >> >> I don't remember why, but I think I had problems with that in the past (I think it didn't look so nice visually, but don't remember), which is why I still stick to graphviz. > Can you give me examples that don't look as nice? I would love to improve it. > >> For my use cases, it's not much hassle -- it used to be a bit of a hassle to get GraphViz working, but now you can do >> >> conda install pydotplus >> conda install graphviz >> >> Coincidentally, I just made an example for a lecture I was teaching on Tue: https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/06_trees/code/06-trees_demo.ipynb >> >> Best, >> Sebastian >> >> >>> On Oct 4, 2019, at 10:09 PM, C W wrote: >>> >>> On a separate note, what do you use for plotting? >>> >>> I found graphviz, but you have to first save it as a png on your computer. That's a lot work for just one plot. Is there something like a matplotlib? >>> >>> Thanks! >>> >>> On Fri, Oct 4, 2019 at 9:42 PM Sebastian Raschka wrote: >>> Yeah, think of it more as a computational workaround for achieving the same thing more efficiently (although it looks inelegant/weird)-- something like that wouldn't be mentioned in textbooks. >>> >>> Best, >>> Sebastian >>> >>>> On Oct 4, 2019, at 6:33 PM, C W wrote: >>>> >>>> Thanks Sebastian, I think I get it. >>>> >>>> It's just have never seen it this way. Quite different from what I'm used in Elements of Statistical Learning. >>>> >>>> On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka wrote: >>>> Not sure if there's a website for that. In any case, to explain this differently, as discussed earlier sklearn assumes continuous features for decision trees. So, it will use a binary threshold for splitting along a feature attribute. In other words, it cannot do sth like >>>> >>>> if x == 1 then right child node >>>> else left child node >>>> >>>> Instead, what it does is >>>> >>>> if x >= 0.5 then right child node >>>> else left child node >>>> >>>> These are basically equivalent as you can see when you just plug in values 0 and 1 for x. >>>> >>>> Best, >>>> Sebastian >>>> >>>>> On Oct 4, 2019, at 5:34 PM, C W wrote: >>>>> >>>>> I don't understand your answer. >>>>> >>>>> Why after one-hot-encoding it still outputs greater than 0.5 or less than? Does sklearn website have a working example on categorical input? >>>>> >>>>> Thanks! >>>>> >>>>> On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka wrote: >>>>> Like Nicolas said, the 0.5 is just a workaround but will do the right thing on the one-hot encoded variables, here. You will find that the threshold is always at 0.5 for these variables. I.e., what it will do is to use the following conversion: >>>>> >>>>> treat as car_Audi=1 if car_Audi >= 0.5 >>>>> treat as car_Audi=0 if car_Audi < 0.5 >>>>> >>>>> or, it may be >>>>> >>>>> treat as car_Audi=1 if car_Audi > 0.5 >>>>> treat as car_Audi=0 if car_Audi <= 0.5 >>>>> >>>>> (Forgot which one sklearn is using, but either way. it will be fine.) >>>>> >>>>> Best, >>>>> Sebastian >>>>> >>>>> >>>>>> On Oct 4, 2019, at 1:44 PM, Nicolas Hug wrote: >>>>>> >>>>>> >>>>>>> But, decision tree is still mistaking one-hot-encoding as numerical input and split at 0.5. This is not right. Perhaps, I'm doing something wrong? >>>>>> You're not doing anything wrong, and neither is the tree. Trees don't support categorical variables in sklearn, so everything is treated as numerical. >>>>>> >>>>>> This is why we do one-hot-encoding: so that a set of numerical (one hot encoded) features can be treated as if they were just one categorical feature. >>>>>> >>>>>> >>>>>> >>>>>> Nicolas >>>>>> >>>>>> On 10/4/19 2:01 PM, C W wrote: >>>>>>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my part. >>>>>>> >>>>>>> Looks like I did one-hot-encoding correctly. My new variable names are: car_Audi, car_BMW, etc. >>>>>>> >>>>>>> But, decision tree is still mistaking one-hot-encoding as numerical input and split at 0.5. This is not right. Perhaps, I'm doing something wrong? >>>>>>> >>>>>>> Is there a good toy example on the sklearn website? I am only see this: https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html. >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka wrote: >>>>>>> Hi, >>>>>>> >>>>>>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5 >>>>>>> that's not a onehot encoding then. >>>>>>> >>>>>>> For an Audi datapoint, it should be >>>>>>> >>>>>>> BMW=0 >>>>>>> Toyota=0 >>>>>>> Audi=1 >>>>>>> >>>>>>> for BMW >>>>>>> >>>>>>> BMW=1 >>>>>>> Toyota=0 >>>>>>> Audi=0 >>>>>>> >>>>>>> and for Toyota >>>>>>> >>>>>>> BMW=0 >>>>>>> Toyota=1 >>>>>>> Audi=0 >>>>>>> >>>>>>> The split threshold should then be at 0.5 for any of these features. >>>>>>> >>>>>>> Based on your email, I think you were assuming that the DT does the one-hot encoding internally, which it doesn't. In practice, it is hard to guess what is a nominal and what is a ordinal variable, so you have to do the onehot encoding before you give the data to the decision tree. >>>>>>> >>>>>>> Best, >>>>>>> Sebastian >>>>>>> >>>>>>>> On Oct 4, 2019, at 11:48 AM, C W wrote: >>>>>>>> >>>>>>>> I'm getting some funny results. I am doing a regression decision tree, the response variables are assigned to levels. >>>>>>>> >>>>>>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, Audi=2) as numerical values, not category. >>>>>>>> >>>>>>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How does the sklearn know internally 0 vs. 1 is categorical, not numerical? >>>>>>>> >>>>>>>> In R for instance, you do as.factor(), which explicitly states the data type. >>>>>>>> >>>>>>>> Thank you! >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 9/15/19 8:16 AM, Guillaume Lema?tre wrote: >>>>>>>>> >>>>>>>>> On Sat, 14 Sep 2019 at 20:59, C W wrote: >>>>>>>>> Thanks, Guillaume. >>>>>>>>> Column transformer looks pretty neat. I've also heard though, this pipeline can be tedious to set up? Specifying what you want for every feature is a pain. >>>>>>>>> >>>>>>>>> It would be interesting for us which part of the pipeline is tedious to set up to know if we can improve something there. >>>>>>>>> Do you mean, that you would like to automatically detect of which type of feature (categorical/numerical) and apply a >>>>>>>>> default encoder/scaling such as discuss there: https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 >>>>>>>>> >>>>>>>>> IMO, one a user perspective, it would be cleaner in some cases at the cost of applying blindly a black box >>>>>>>>> which might be dangerous. >>>>>>>> Also see https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor >>>>>>>> Which basically does that. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> Jaiver, >>>>>>>>> Actually, you guessed right. My real data has only one numerical variable, looks more like this: >>>>>>>>> >>>>>>>>> Gender Date Income Car Attendance >>>>>>>>> Male 2019/3/01 10000 BMW Yes >>>>>>>>> Female 2019/5/02 9000 Toyota No >>>>>>>>> Male 2019/7/15 12000 Audi Yes >>>>>>>>> >>>>>>>>> I am predicting income using all other categorical variables. Maybe it is catboost! >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> M >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier L?pez wrote: >>>>>>>>> If you have datasets with many categorical features, and perhaps many categories, the tools in sklearn are quite limited, >>>>>>>>> but there are alternative implementations of boosted trees that are designed with categorical features in mind. Take a look >>>>>>>>> at catboost [1], which has an sklearn-compatible API. >>>>>>>>> >>>>>>>>> J >>>>>>>>> >>>>>>>>> [1] https://catboost.ai/ >>>>>>>>> >>>>>>>>> On Sat, Sep 14, 2019 at 3:40 AM C W wrote: >>>>>>>>> Hello all, >>>>>>>>> I'm very confused. Can the decision tree module handle both continuous and categorical features in the dataset? In this case, it's just CART (Classification and Regression Trees). >>>>>>>>> >>>>>>>>> For example, >>>>>>>>> Gender Age Income Car Attendance >>>>>>>>> Male 30 10000 BMW Yes >>>>>>>>> Female 35 9000 Toyota No >>>>>>>>> Male 50 12000 Audi Yes >>>>>>>>> >>>>>>>>> According to the documentation https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart, it can not! >>>>>>>>> >>>>>>>>> It says: "scikit-learn implementation does not support categorical variables for now". >>>>>>>>> >>>>>>>>> Is this true? If not, can someone point me to an example? If yes, what do people do? >>>>>>>>> >>>>>>>>> Thank you very much! >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> scikit-learn mailing list >>>>>>>>> scikit-learn at python.org >>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>> _______________________________________________ >>>>>>>>> scikit-learn mailing list >>>>>>>>> scikit-learn at python.org >>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>> _______________________________________________ >>>>>>>>> scikit-learn mailing list >>>>>>>>> scikit-learn at python.org >>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Guillaume Lemaitre >>>>>>>>> INRIA Saclay - Parietal team >>>>>>>>> Center for Data Science Paris-Saclay >>>>>>>>> https://glemaitre.github.io/ >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> scikit-learn mailing list >>>>>>>>> >>>>>>>>> scikit-learn at python.org >>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>> _______________________________________________ >>>>>>>> scikit-learn mailing list >>>>>>>> scikit-learn at python.org >>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>> _______________________________________________ >>>>>>>> scikit-learn mailing list >>>>>>>> scikit-learn at python.org >>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn at python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> >>>>>>> scikit-learn at python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From t3kcit at gmail.com Sun Oct 6 10:55:49 2019 From: t3kcit at gmail.com (Andreas Mueller) Date: Sun, 6 Oct 2019 16:55:49 +0200 Subject: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features? In-Reply-To: References: <5e9661ff-dfb2-cc2e-b71f-ba18024374a1@gmail.com> <7E3EE86D-4B8A-438A-B03A-8DFC8E1D8AB4@sebastianraschka.com> <7A0589D1-D990-4FD6-9D11-AA804E34F3BC@sebastianraschka.com> <4FC33890-94D3-4AA8-8FA9-EF1FADFD4C20@sebastianraschka.com> <3d6e9116-43bf-77d3-dfeb-ec6c91041748@gmail.com> Message-ID: <9c0d0591-f631-bb94-9018-955f24d189d0@gmail.com> Thanks! I'll double check that issue. Generally you have to set the figure size to get good results. We should probably add some code to set the figure size automatically (if we create a figure?). On 10/6/19 10:40 AM, Sebastian Raschka wrote: > Sure, I just ran an example I made with graphviz via plot_tree, and it looks like there's an issue with overlapping boxes if you use class (and/or feature) names. I made a reproducible example here so that you can take a look: > https://github.com/rasbt/bugreport/blob/master/scikit-learn/plot_tree/tree-demo-1.ipynb > > Happy to add this to the sklearn issue list if there's no issue filed for that yet. > > Best, > Sebastian > >> On Oct 6, 2019, at 9:10 AM, Andreas Mueller wrote: >> >> >> >> On 10/4/19 11:28 PM, Sebastian Raschka wrote: >>> The docs show a way such that you don't need to write it as png file using tree.plot_tree: >>> https://scikit-learn.org/stable/modules/tree.html#classification >>> >>> I don't remember why, but I think I had problems with that in the past (I think it didn't look so nice visually, but don't remember), which is why I still stick to graphviz. >> Can you give me examples that don't look as nice? I would love to improve it. >> >>> For my use cases, it's not much hassle -- it used to be a bit of a hassle to get GraphViz working, but now you can do >>> >>> conda install pydotplus >>> conda install graphviz >>> >>> Coincidentally, I just made an example for a lecture I was teaching on Tue: https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/06_trees/code/06-trees_demo.ipynb >>> >>> Best, >>> Sebastian >>> >>> >>>> On Oct 4, 2019, at 10:09 PM, C W wrote: >>>> >>>> On a separate note, what do you use for plotting? >>>> >>>> I found graphviz, but you have to first save it as a png on your computer. That's a lot work for just one plot. Is there something like a matplotlib? >>>> >>>> Thanks! >>>> >>>> On Fri, Oct 4, 2019 at 9:42 PM Sebastian Raschka wrote: >>>> Yeah, think of it more as a computational workaround for achieving the same thing more efficiently (although it looks inelegant/weird)-- something like that wouldn't be mentioned in textbooks. >>>> >>>> Best, >>>> Sebastian >>>> >>>>> On Oct 4, 2019, at 6:33 PM, C W wrote: >>>>> >>>>> Thanks Sebastian, I think I get it. >>>>> >>>>> It's just have never seen it this way. Quite different from what I'm used in Elements of Statistical Learning. >>>>> >>>>> On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka wrote: >>>>> Not sure if there's a website for that. In any case, to explain this differently, as discussed earlier sklearn assumes continuous features for decision trees. So, it will use a binary threshold for splitting along a feature attribute. In other words, it cannot do sth like >>>>> >>>>> if x == 1 then right child node >>>>> else left child node >>>>> >>>>> Instead, what it does is >>>>> >>>>> if x >= 0.5 then right child node >>>>> else left child node >>>>> >>>>> These are basically equivalent as you can see when you just plug in values 0 and 1 for x. >>>>> >>>>> Best, >>>>> Sebastian >>>>> >>>>>> On Oct 4, 2019, at 5:34 PM, C W wrote: >>>>>> >>>>>> I don't understand your answer. >>>>>> >>>>>> Why after one-hot-encoding it still outputs greater than 0.5 or less than? Does sklearn website have a working example on categorical input? >>>>>> >>>>>> Thanks! >>>>>> >>>>>> On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka wrote: >>>>>> Like Nicolas said, the 0.5 is just a workaround but will do the right thing on the one-hot encoded variables, here. You will find that the threshold is always at 0.5 for these variables. I.e., what it will do is to use the following conversion: >>>>>> >>>>>> treat as car_Audi=1 if car_Audi >= 0.5 >>>>>> treat as car_Audi=0 if car_Audi < 0.5 >>>>>> >>>>>> or, it may be >>>>>> >>>>>> treat as car_Audi=1 if car_Audi > 0.5 >>>>>> treat as car_Audi=0 if car_Audi <= 0.5 >>>>>> >>>>>> (Forgot which one sklearn is using, but either way. it will be fine.) >>>>>> >>>>>> Best, >>>>>> Sebastian >>>>>> >>>>>> >>>>>>> On Oct 4, 2019, at 1:44 PM, Nicolas Hug wrote: >>>>>>> >>>>>>> >>>>>>>> But, decision tree is still mistaking one-hot-encoding as numerical input and split at 0.5. This is not right. Perhaps, I'm doing something wrong? >>>>>>> You're not doing anything wrong, and neither is the tree. Trees don't support categorical variables in sklearn, so everything is treated as numerical. >>>>>>> >>>>>>> This is why we do one-hot-encoding: so that a set of numerical (one hot encoded) features can be treated as if they were just one categorical feature. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Nicolas >>>>>>> >>>>>>> On 10/4/19 2:01 PM, C W wrote: >>>>>>>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my part. >>>>>>>> >>>>>>>> Looks like I did one-hot-encoding correctly. My new variable names are: car_Audi, car_BMW, etc. >>>>>>>> >>>>>>>> But, decision tree is still mistaking one-hot-encoding as numerical input and split at 0.5. This is not right. Perhaps, I'm doing something wrong? >>>>>>>> >>>>>>>> Is there a good toy example on the sklearn website? I am only see this: https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html. >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5 >>>>>>>> that's not a onehot encoding then. >>>>>>>> >>>>>>>> For an Audi datapoint, it should be >>>>>>>> >>>>>>>> BMW=0 >>>>>>>> Toyota=0 >>>>>>>> Audi=1 >>>>>>>> >>>>>>>> for BMW >>>>>>>> >>>>>>>> BMW=1 >>>>>>>> Toyota=0 >>>>>>>> Audi=0 >>>>>>>> >>>>>>>> and for Toyota >>>>>>>> >>>>>>>> BMW=0 >>>>>>>> Toyota=1 >>>>>>>> Audi=0 >>>>>>>> >>>>>>>> The split threshold should then be at 0.5 for any of these features. >>>>>>>> >>>>>>>> Based on your email, I think you were assuming that the DT does the one-hot encoding internally, which it doesn't. In practice, it is hard to guess what is a nominal and what is a ordinal variable, so you have to do the onehot encoding before you give the data to the decision tree. >>>>>>>> >>>>>>>> Best, >>>>>>>> Sebastian >>>>>>>> >>>>>>>>> On Oct 4, 2019, at 11:48 AM, C W wrote: >>>>>>>>> >>>>>>>>> I'm getting some funny results. I am doing a regression decision tree, the response variables are assigned to levels. >>>>>>>>> >>>>>>>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, Audi=2) as numerical values, not category. >>>>>>>>> >>>>>>>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How does the sklearn know internally 0 vs. 1 is categorical, not numerical? >>>>>>>>> >>>>>>>>> In R for instance, you do as.factor(), which explicitly states the data type. >>>>>>>>> >>>>>>>>> Thank you! >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> On 9/15/19 8:16 AM, Guillaume Lema?tre wrote: >>>>>>>>>> On Sat, 14 Sep 2019 at 20:59, C W wrote: >>>>>>>>>> Thanks, Guillaume. >>>>>>>>>> Column transformer looks pretty neat. I've also heard though, this pipeline can be tedious to set up? Specifying what you want for every feature is a pain. >>>>>>>>>> >>>>>>>>>> It would be interesting for us which part of the pipeline is tedious to set up to know if we can improve something there. >>>>>>>>>> Do you mean, that you would like to automatically detect of which type of feature (categorical/numerical) and apply a >>>>>>>>>> default encoder/scaling such as discuss there: https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 >>>>>>>>>> >>>>>>>>>> IMO, one a user perspective, it would be cleaner in some cases at the cost of applying blindly a black box >>>>>>>>>> which might be dangerous. >>>>>>>>> Also see https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor >>>>>>>>> Which basically does that. >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Jaiver, >>>>>>>>>> Actually, you guessed right. My real data has only one numerical variable, looks more like this: >>>>>>>>>> >>>>>>>>>> Gender Date Income Car Attendance >>>>>>>>>> Male 2019/3/01 10000 BMW Yes >>>>>>>>>> Female 2019/5/02 9000 Toyota No >>>>>>>>>> Male 2019/7/15 12000 Audi Yes >>>>>>>>>> >>>>>>>>>> I am predicting income using all other categorical variables. Maybe it is catboost! >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> M >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier L?pez wrote: >>>>>>>>>> If you have datasets with many categorical features, and perhaps many categories, the tools in sklearn are quite limited, >>>>>>>>>> but there are alternative implementations of boosted trees that are designed with categorical features in mind. Take a look >>>>>>>>>> at catboost [1], which has an sklearn-compatible API. >>>>>>>>>> >>>>>>>>>> J >>>>>>>>>> >>>>>>>>>> [1] https://catboost.ai/ >>>>>>>>>> >>>>>>>>>> On Sat, Sep 14, 2019 at 3:40 AM C W wrote: >>>>>>>>>> Hello all, >>>>>>>>>> I'm very confused. Can the decision tree module handle both continuous and categorical features in the dataset? In this case, it's just CART (Classification and Regression Trees). >>>>>>>>>> >>>>>>>>>> For example, >>>>>>>>>> Gender Age Income Car Attendance >>>>>>>>>> Male 30 10000 BMW Yes >>>>>>>>>> Female 35 9000 Toyota No >>>>>>>>>> Male 50 12000 Audi Yes >>>>>>>>>> >>>>>>>>>> According to the documentation https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart, it can not! >>>>>>>>>> >>>>>>>>>> It says: "scikit-learn implementation does not support categorical variables for now". >>>>>>>>>> >>>>>>>>>> Is this true? If not, can someone point me to an example? If yes, what do people do? >>>>>>>>>> >>>>>>>>>> Thank you very much! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> scikit-learn mailing list >>>>>>>>>> scikit-learn at python.org >>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>>> _______________________________________________ >>>>>>>>>> scikit-learn mailing list >>>>>>>>>> scikit-learn at python.org >>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>>> _______________________________________________ >>>>>>>>>> scikit-learn mailing list >>>>>>>>>> scikit-learn at python.org >>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Guillaume Lemaitre >>>>>>>>>> INRIA Saclay - Parietal team >>>>>>>>>> Center for Data Science Paris-Saclay >>>>>>>>>> https://glemaitre.github.io/ >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> scikit-learn mailing list >>>>>>>>>> >>>>>>>>>> scikit-learn at python.org >>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>> _______________________________________________ >>>>>>>>> scikit-learn mailing list >>>>>>>>> scikit-learn at python.org >>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>> _______________________________________________ >>>>>>>>> scikit-learn mailing list >>>>>>>>> scikit-learn at python.org >>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>> _______________________________________________ >>>>>>>> scikit-learn mailing list >>>>>>>> scikit-learn at python.org >>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> scikit-learn mailing list >>>>>>>> >>>>>>>> scikit-learn at python.org >>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn at python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From mail at sebastianraschka.com Sun Oct 6 11:11:24 2019 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Sun, 6 Oct 2019 10:11:24 -0500 Subject: [scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features? In-Reply-To: <9c0d0591-f631-bb94-9018-955f24d189d0@gmail.com> References: <5e9661ff-dfb2-cc2e-b71f-ba18024374a1@gmail.com> <7E3EE86D-4B8A-438A-B03A-8DFC8E1D8AB4@sebastianraschka.com> <7A0589D1-D990-4FD6-9D11-AA804E34F3BC@sebastianraschka.com> <4FC33890-94D3-4AA8-8FA9-EF1FADFD4C20@sebastianraschka.com> <3d6e9116-43bf-77d3-dfeb-ec6c91041748@gmail.com> <9c0d0591-f631-bb94-9018-955f24d189d0@gmail.com> Message-ID: <32449B7F-691B-4FF9-A7CC-A784B19A3852@sebastianraschka.com> You are right, changing the figure size would fix the issue (updated the notebook). In practice, I think the issue becomes choosing a good aspect ratio such that the a) general proportions of the plot look ok b) proportions of the boxes wrt the arrows look ok It's all possible for a user to do, but for my use cases (e.g., making a quick graphic for a presentation / meeting) it was just quicker with graphviz. On the other hand, I would prefer/recommend the plot_tree func just because it is based on matplotlib ... In any case, I haven't had a chance to look at the plot_tree func but I guess this could potentially be relatively easy to address. I guess it would just require finding and setting a good default value for the a) XOR case where a user provides either feature names or class label names. b) AND case where a user provides both feature names and class label names. > On Oct 6, 2019, at 9:55 AM, Andreas Mueller wrote: > > Thanks! > I'll double check that issue. Generally you have to set the figure size to get good results. > We should probably add some code to set the figure size automatically (if we create a figure?). > > > On 10/6/19 10:40 AM, Sebastian Raschka wrote: >> Sure, I just ran an example I made with graphviz via plot_tree, and it looks like there's an issue with overlapping boxes if you use class (and/or feature) names. I made a reproducible example here so that you can take a look: >> https://github.com/rasbt/bugreport/blob/master/scikit-learn/plot_tree/tree-demo-1.ipynb >> >> Happy to add this to the sklearn issue list if there's no issue filed for that yet. >> >> Best, >> Sebastian >> >>> On Oct 6, 2019, at 9:10 AM, Andreas Mueller wrote: >>> >>> >>> >>> On 10/4/19 11:28 PM, Sebastian Raschka wrote: >>>> The docs show a way such that you don't need to write it as png file using tree.plot_tree: >>>> https://scikit-learn.org/stable/modules/tree.html#classification >>>> >>>> I don't remember why, but I think I had problems with that in the past (I think it didn't look so nice visually, but don't remember), which is why I still stick to graphviz. >>> Can you give me examples that don't look as nice? I would love to improve it. >>> >>>> For my use cases, it's not much hassle -- it used to be a bit of a hassle to get GraphViz working, but now you can do >>>> >>>> conda install pydotplus >>>> conda install graphviz >>>> >>>> Coincidentally, I just made an example for a lecture I was teaching on Tue: https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/06_trees/code/06-trees_demo.ipynb >>>> >>>> Best, >>>> Sebastian >>>> >>>> >>>>> On Oct 4, 2019, at 10:09 PM, C W wrote: >>>>> >>>>> On a separate note, what do you use for plotting? >>>>> >>>>> I found graphviz, but you have to first save it as a png on your computer. That's a lot work for just one plot. Is there something like a matplotlib? >>>>> >>>>> Thanks! >>>>> >>>>> On Fri, Oct 4, 2019 at 9:42 PM Sebastian Raschka wrote: >>>>> Yeah, think of it more as a computational workaround for achieving the same thing more efficiently (although it looks inelegant/weird)-- something like that wouldn't be mentioned in textbooks. >>>>> >>>>> Best, >>>>> Sebastian >>>>> >>>>>> On Oct 4, 2019, at 6:33 PM, C W wrote: >>>>>> >>>>>> Thanks Sebastian, I think I get it. >>>>>> >>>>>> It's just have never seen it this way. Quite different from what I'm used in Elements of Statistical Learning. >>>>>> >>>>>> On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka wrote: >>>>>> Not sure if there's a website for that. In any case, to explain this differently, as discussed earlier sklearn assumes continuous features for decision trees. So, it will use a binary threshold for splitting along a feature attribute. In other words, it cannot do sth like >>>>>> >>>>>> if x == 1 then right child node >>>>>> else left child node >>>>>> >>>>>> Instead, what it does is >>>>>> >>>>>> if x >= 0.5 then right child node >>>>>> else left child node >>>>>> >>>>>> These are basically equivalent as you can see when you just plug in values 0 and 1 for x. >>>>>> >>>>>> Best, >>>>>> Sebastian >>>>>> >>>>>>> On Oct 4, 2019, at 5:34 PM, C W wrote: >>>>>>> >>>>>>> I don't understand your answer. >>>>>>> >>>>>>> Why after one-hot-encoding it still outputs greater than 0.5 or less than? Does sklearn website have a working example on categorical input? >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka wrote: >>>>>>> Like Nicolas said, the 0.5 is just a workaround but will do the right thing on the one-hot encoded variables, here. You will find that the threshold is always at 0.5 for these variables. I.e., what it will do is to use the following conversion: >>>>>>> >>>>>>> treat as car_Audi=1 if car_Audi >= 0.5 >>>>>>> treat as car_Audi=0 if car_Audi < 0.5 >>>>>>> >>>>>>> or, it may be >>>>>>> >>>>>>> treat as car_Audi=1 if car_Audi > 0.5 >>>>>>> treat as car_Audi=0 if car_Audi <= 0.5 >>>>>>> >>>>>>> (Forgot which one sklearn is using, but either way. it will be fine.) >>>>>>> >>>>>>> Best, >>>>>>> Sebastian >>>>>>> >>>>>>> >>>>>>>> On Oct 4, 2019, at 1:44 PM, Nicolas Hug wrote: >>>>>>>> >>>>>>>> >>>>>>>>> But, decision tree is still mistaking one-hot-encoding as numerical input and split at 0.5. This is not right. Perhaps, I'm doing something wrong? >>>>>>>> You're not doing anything wrong, and neither is the tree. Trees don't support categorical variables in sklearn, so everything is treated as numerical. >>>>>>>> >>>>>>>> This is why we do one-hot-encoding: so that a set of numerical (one hot encoded) features can be treated as if they were just one categorical feature. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Nicolas >>>>>>>> >>>>>>>> On 10/4/19 2:01 PM, C W wrote: >>>>>>>>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my part. >>>>>>>>> >>>>>>>>> Looks like I did one-hot-encoding correctly. My new variable names are: car_Audi, car_BMW, etc. >>>>>>>>> >>>>>>>>> But, decision tree is still mistaking one-hot-encoding as numerical input and split at 0.5. This is not right. Perhaps, I'm doing something wrong? >>>>>>>>> >>>>>>>>> Is there a good toy example on the sklearn website? I am only see this: https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html. >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5 >>>>>>>>> that's not a onehot encoding then. >>>>>>>>> >>>>>>>>> For an Audi datapoint, it should be >>>>>>>>> >>>>>>>>> BMW=0 >>>>>>>>> Toyota=0 >>>>>>>>> Audi=1 >>>>>>>>> >>>>>>>>> for BMW >>>>>>>>> >>>>>>>>> BMW=1 >>>>>>>>> Toyota=0 >>>>>>>>> Audi=0 >>>>>>>>> >>>>>>>>> and for Toyota >>>>>>>>> >>>>>>>>> BMW=0 >>>>>>>>> Toyota=1 >>>>>>>>> Audi=0 >>>>>>>>> >>>>>>>>> The split threshold should then be at 0.5 for any of these features. >>>>>>>>> >>>>>>>>> Based on your email, I think you were assuming that the DT does the one-hot encoding internally, which it doesn't. In practice, it is hard to guess what is a nominal and what is a ordinal variable, so you have to do the onehot encoding before you give the data to the decision tree. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Sebastian >>>>>>>>> >>>>>>>>>> On Oct 4, 2019, at 11:48 AM, C W wrote: >>>>>>>>>> >>>>>>>>>> I'm getting some funny results. I am doing a regression decision tree, the response variables are assigned to levels. >>>>>>>>>> >>>>>>>>>> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1, Audi=2) as numerical values, not category. >>>>>>>>>> >>>>>>>>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How does the sklearn know internally 0 vs. 1 is categorical, not numerical? >>>>>>>>>> >>>>>>>>>> In R for instance, you do as.factor(), which explicitly states the data type. >>>>>>>>>> >>>>>>>>>> Thank you! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 9/15/19 8:16 AM, Guillaume Lema?tre wrote: >>>>>>>>>>> On Sat, 14 Sep 2019 at 20:59, C W wrote: >>>>>>>>>>> Thanks, Guillaume. >>>>>>>>>>> Column transformer looks pretty neat. I've also heard though, this pipeline can be tedious to set up? Specifying what you want for every feature is a pain. >>>>>>>>>>> >>>>>>>>>>> It would be interesting for us which part of the pipeline is tedious to set up to know if we can improve something there. >>>>>>>>>>> Do you mean, that you would like to automatically detect of which type of feature (categorical/numerical) and apply a >>>>>>>>>>> default encoder/scaling such as discuss there: https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 >>>>>>>>>>> >>>>>>>>>>> IMO, one a user perspective, it would be cleaner in some cases at the cost of applying blindly a black box >>>>>>>>>>> which might be dangerous. >>>>>>>>>> Also see https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor >>>>>>>>>> Which basically does that. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Jaiver, >>>>>>>>>>> Actually, you guessed right. My real data has only one numerical variable, looks more like this: >>>>>>>>>>> >>>>>>>>>>> Gender Date Income Car Attendance >>>>>>>>>>> Male 2019/3/01 10000 BMW Yes >>>>>>>>>>> Female 2019/5/02 9000 Toyota No >>>>>>>>>>> Male 2019/7/15 12000 Audi Yes >>>>>>>>>>> >>>>>>>>>>> I am predicting income using all other categorical variables. Maybe it is catboost! >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> M >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier L?pez wrote: >>>>>>>>>>> If you have datasets with many categorical features, and perhaps many categories, the tools in sklearn are quite limited, >>>>>>>>>>> but there are alternative implementations of boosted trees that are designed with categorical features in mind. Take a look >>>>>>>>>>> at catboost [1], which has an sklearn-compatible API. >>>>>>>>>>> >>>>>>>>>>> J >>>>>>>>>>> >>>>>>>>>>> [1] https://catboost.ai/ >>>>>>>>>>> >>>>>>>>>>> On Sat, Sep 14, 2019 at 3:40 AM C W wrote: >>>>>>>>>>> Hello all, >>>>>>>>>>> I'm very confused. Can the decision tree module handle both continuous and categorical features in the dataset? In this case, it's just CART (Classification and Regression Trees). >>>>>>>>>>> >>>>>>>>>>> For example, >>>>>>>>>>> Gender Age Income Car Attendance >>>>>>>>>>> Male 30 10000 BMW Yes >>>>>>>>>>> Female 35 9000 Toyota No >>>>>>>>>>> Male 50 12000 Audi Yes >>>>>>>>>>> >>>>>>>>>>> According to the documentation https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart, it can not! >>>>>>>>>>> >>>>>>>>>>> It says: "scikit-learn implementation does not support categorical variables for now". >>>>>>>>>>> >>>>>>>>>>> Is this true? If not, can someone point me to an example? If yes, what do people do? >>>>>>>>>>> >>>>>>>>>>> Thank you very much! >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> scikit-learn mailing list >>>>>>>>>>> scikit-learn at python.org >>>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> scikit-learn mailing list >>>>>>>>>>> scikit-learn at python.org >>>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> scikit-learn mailing list >>>>>>>>>>> scikit-learn at python.org >>>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Guillaume Lemaitre >>>>>>>>>>> INRIA Saclay - Parietal team >>>>>>>>>>> Center for Data Science Paris-Saclay >>>>>>>>>>> https://glemaitre.github.io/ >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> scikit-learn mailing list >>>>>>>>>>> >>>>>>>>>>> scikit-learn at python.org >>>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>>> _______________________________________________ >>>>>>>>>> scikit-learn mailing list >>>>>>>>>> scikit-learn at python.org >>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>>> _______________________________________________ >>>>>>>>>> scikit-learn mailing list >>>>>>>>>> scikit-learn at python.org >>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>> _______________________________________________ >>>>>>>>> scikit-learn mailing list >>>>>>>>> scikit-learn at python.org >>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> scikit-learn mailing list >>>>>>>>> >>>>>>>>> scikit-learn at python.org >>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>> _______________________________________________ >>>>>>>> scikit-learn mailing list >>>>>>>> scikit-learn at python.org >>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn at python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn at python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From stuart at stuartreynolds.net Sun Oct 6 18:16:59 2019 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Sun, 6 Oct 2019 15:16:59 -0700 Subject: [scikit-learn] scikit-learn Digest, Vol 43, Issue 11 In-Reply-To: References: Message-ID: Pandas has a read_excel function that can load data from an excel spreadsheet: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html On Sun, Oct 6, 2019 at 1:57 AM Mike Smith wrote: > Can I call an MSExcel cell range in a function such as model.predict(), > instead of typing the data in for each element? > > On Sat, Oct 5, 2019 at 11:58 AM wrote: > >> Send scikit-learn mailing list submissions to >> scikit-learn at python.org >> >> To subscribe or unsubscribe via the World Wide Web, visit >> https://mail.python.org/mailman/listinfo/scikit-learn >> or, via email, send a message with subject or body 'help' to >> scikit-learn-request at python.org >> >> You can reach the person managing the list at >> scikit-learn-owner at python.org >> >> When replying, please edit your Subject line so it is more specific >> than "Re: Contents of scikit-learn digest..." >> >> >> Today's Topics: >> >> 1. Re: scikit-learn Digest, Vol 43, Issue 10 (Mike Smith) >> >> >> ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Sat, 5 Oct 2019 11:55:33 -0700 >> From: Mike Smith >> To: scikit-learn at python.org >> Subject: Re: [scikit-learn] scikit-learn Digest, Vol 43, Issue 10 >> Message-ID: >> > A at mail.gmail.com> >> Content-Type: text/plain; charset="utf-8" >> >> 1. Re: Can Scikit-learn decision tree (CART) have both >> continuous and categorical features? (C W) >> >> What I'd ask in reply to this is if regression and classification module >> results can be entered into an input for one resultant output. >> >> >> >> On Sat, Oct 5, 2019, 11:50 AM , wrote: >> >> > Send scikit-learn mailing list submissions to >> > scikit-learn at python.org >> > >> > To subscribe or unsubscribe via the World Wide Web, visit >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > or, via email, send a message with subject or body 'help' to >> > scikit-learn-request at python.org >> > >> > You can reach the person managing the list at >> > scikit-learn-owner at python.org >> > >> > When replying, please edit your Subject line so it is more specific >> > than "Re: Contents of scikit-learn digest..." >> > >> > >> > Today's Topics: >> > >> > 1. Re: Can Scikit-learn decision tree (CART) have both >> > continuous and categorical features? (C W) >> > >> > >> > ---------------------------------------------------------------------- >> > >> > Message: 1 >> > Date: Sat, 5 Oct 2019 14:50:09 -0400 >> > From: C W >> > To: Scikit-learn mailing list >> > Subject: Re: [scikit-learn] Can Scikit-learn decision tree (CART) have >> > both continuous and categorical features? >> > Message-ID: >> > < >> > CAE2FW2nHDJGNky2VWk-U8fU3gqwBqWEgidzTAWnUq+NzAK68VA at mail.gmail.com> >> > Content-Type: text/plain; charset="utf-8" >> > >> > Thanks, great material! I got pydotplus with graphviz to work. >> > >> > Using the code on sklean website [1], tree.plot_tree(clf.fit(iris.data, >> > iris.target)) gives an error: >> > AttributeError: module 'sklearn.tree' has no attribute 'plot_tree' >> > >> > Both my colleague and I got the same error message. Per this post >> > https://github.com/Microsoft/LightGBM/issues/1844, a PyPI update is >> > needed. >> > >> > [1] sklearn link: >> > https://scikit-learn.org/stable/modules/tree.html#classification >> > >> > >> > On Fri, Oct 4, 2019 at 11:52 PM Sebastian Raschka < >> > mail at sebastianraschka.com> >> > wrote: >> > >> > > The docs show a way such that you don't need to write it as png file >> > using >> > > tree.plot_tree: >> > > https://scikit-learn.org/stable/modules/tree.html#classification >> > > >> > > I don't remember why, but I think I had problems with that in the >> past (I >> > > think it didn't look so nice visually, but don't remember), which is >> why >> > I >> > > still stick to graphviz. For my use cases, it's not much hassle -- it >> > used >> > > to be a bit of a hassle to get GraphViz working, but now you can do >> > > >> > > conda install pydotplus >> > > conda install graphviz >> > > >> > > Coincidentally, I just made an example for a lecture I was teaching on >> > > Tue: >> > > >> > >> https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/06_trees/code/06-trees_demo.ipynb >> > > >> > > Best, >> > > Sebastian >> > > >> > > >> > > > On Oct 4, 2019, at 10:09 PM, C W wrote: >> > > > >> > > > On a separate note, what do you use for plotting? >> > > > >> > > > I found graphviz, but you have to first save it as a png on your >> > > computer. That's a lot work for just one plot. Is there something >> like a >> > > matplotlib? >> > > > >> > > > Thanks! >> > > > >> > > > On Fri, Oct 4, 2019 at 9:42 PM Sebastian Raschka < >> > > mail at sebastianraschka.com> wrote: >> > > > Yeah, think of it more as a computational workaround for achieving >> the >> > > same thing more efficiently (although it looks inelegant/weird)-- >> > something >> > > like that wouldn't be mentioned in textbooks. >> > > > >> > > > Best, >> > > > Sebastian >> > > > >> > > > > On Oct 4, 2019, at 6:33 PM, C W wrote: >> > > > > >> > > > > Thanks Sebastian, I think I get it. >> > > > > >> > > > > It's just have never seen it this way. Quite different from what >> I'm >> > > used in Elements of Statistical Learning. >> > > > > >> > > > > On Fri, Oct 4, 2019 at 7:13 PM Sebastian Raschka < >> > > mail at sebastianraschka.com> wrote: >> > > > > Not sure if there's a website for that. In any case, to explain >> this >> > > differently, as discussed earlier sklearn assumes continuous features >> for >> > > decision trees. So, it will use a binary threshold for splitting >> along a >> > > feature attribute. In other words, it cannot do sth like >> > > > > >> > > > > if x == 1 then right child node >> > > > > else left child node >> > > > > >> > > > > Instead, what it does is >> > > > > >> > > > > if x >= 0.5 then right child node >> > > > > else left child node >> > > > > >> > > > > These are basically equivalent as you can see when you just plug >> in >> > > values 0 and 1 for x. >> > > > > >> > > > > Best, >> > > > > Sebastian >> > > > > >> > > > > > On Oct 4, 2019, at 5:34 PM, C W wrote: >> > > > > > >> > > > > > I don't understand your answer. >> > > > > > >> > > > > > Why after one-hot-encoding it still outputs greater than 0.5 or >> > less >> > > than? Does sklearn website have a working example on categorical >> input? >> > > > > > >> > > > > > Thanks! >> > > > > > >> > > > > > On Fri, Oct 4, 2019 at 3:48 PM Sebastian Raschka < >> > > mail at sebastianraschka.com> wrote: >> > > > > > Like Nicolas said, the 0.5 is just a workaround but will do the >> > > right thing on the one-hot encoded variables, here. You will find that >> > the >> > > threshold is always at 0.5 for these variables. I.e., what it will do >> is >> > to >> > > use the following conversion: >> > > > > > >> > > > > > treat as car_Audi=1 if car_Audi >= 0.5 >> > > > > > treat as car_Audi=0 if car_Audi < 0.5 >> > > > > > >> > > > > > or, it may be >> > > > > > >> > > > > > treat as car_Audi=1 if car_Audi > 0.5 >> > > > > > treat as car_Audi=0 if car_Audi <= 0.5 >> > > > > > >> > > > > > (Forgot which one sklearn is using, but either way. it will be >> > fine.) >> > > > > > >> > > > > > Best, >> > > > > > Sebastian >> > > > > > >> > > > > > >> > > > > >> On Oct 4, 2019, at 1:44 PM, Nicolas Hug >> wrote: >> > > > > >> >> > > > > >> >> > > > > >>> But, decision tree is still mistaking one-hot-encoding as >> > > numerical input and split at 0.5. This is not right. Perhaps, I'm >> doing >> > > something wrong? >> > > > > >> >> > > > > >> You're not doing anything wrong, and neither is the tree. Trees >> > > don't support categorical variables in sklearn, so everything is >> treated >> > as >> > > numerical. >> > > > > >> >> > > > > >> This is why we do one-hot-encoding: so that a set of numerical >> > (one >> > > hot encoded) features can be treated as if they were just one >> categorical >> > > feature. >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> Nicolas >> > > > > >> >> > > > > >> On 10/4/19 2:01 PM, C W wrote: >> > > > > >>> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, >> > > typo on my part. >> > > > > >>> >> > > > > >>> Looks like I did one-hot-encoding correctly. My new variable >> > names >> > > are: car_Audi, car_BMW, etc. >> > > > > >>> >> > > > > >>> But, decision tree is still mistaking one-hot-encoding as >> > > numerical input and split at 0.5. This is not right. Perhaps, I'm >> doing >> > > something wrong? >> > > > > >>> >> > > > > >>> Is there a good toy example on the sklearn website? I am only >> see >> > > this: >> > > >> > >> https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html >> > > . >> > > > > >>> >> > > > > >>> Thanks! >> > > > > >>> >> > > > > >>> >> > > > > >>> >> > > > > >>> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka < >> > > mail at sebastianraschka.com> wrote: >> > > > > >>> Hi, >> > > > > >>> >> > > > > >>>> The funny part is: the tree is taking one-hot-encoding >> (BMW=0, >> > > Toyota=1, Audi=2) as numerical values, not category.The tree splits at >> > 0.5 >> > > and 1.5 >> > > > > >>> >> > > > > >>> that's not a onehot encoding then. >> > > > > >>> >> > > > > >>> For an Audi datapoint, it should be >> > > > > >>> >> > > > > >>> BMW=0 >> > > > > >>> Toyota=0 >> > > > > >>> Audi=1 >> > > > > >>> >> > > > > >>> for BMW >> > > > > >>> >> > > > > >>> BMW=1 >> > > > > >>> Toyota=0 >> > > > > >>> Audi=0 >> > > > > >>> >> > > > > >>> and for Toyota >> > > > > >>> >> > > > > >>> BMW=0 >> > > > > >>> Toyota=1 >> > > > > >>> Audi=0 >> > > > > >>> >> > > > > >>> The split threshold should then be at 0.5 for any of these >> > > features. >> > > > > >>> >> > > > > >>> Based on your email, I think you were assuming that the DT >> does >> > > the one-hot encoding internally, which it doesn't. In practice, it is >> > hard >> > > to guess what is a nominal and what is a ordinal variable, so you >> have to >> > > do the onehot encoding before you give the data to the decision tree. >> > > > > >>> >> > > > > >>> Best, >> > > > > >>> Sebastian >> > > > > >>> >> > > > > >>>> On Oct 4, 2019, at 11:48 AM, C W wrote: >> > > > > >>>> >> > > > > >>>> I'm getting some funny results. I am doing a regression >> decision >> > > tree, the response variables are assigned to levels. >> > > > > >>>> >> > > > > >>>> The funny part is: the tree is taking one-hot-encoding >> (BMW=0, >> > > Toyota=1, Audi=2) as numerical values, not category. >> > > > > >>>> >> > > > > >>>> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding >> > > wrong? How does the sklearn know internally 0 vs. 1 is categorical, >> not >> > > numerical? >> > > > > >>>> >> > > > > >>>> In R for instance, you do as.factor(), which explicitly >> states >> > > the data type. >> > > > > >>>> >> > > > > >>>> Thank you! >> > > > > >>>> >> > > > > >>>> >> > > > > >>>> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller < >> > > t3kcit at gmail.com> wrote: >> > > > > >>>> >> > > > > >>>> >> > > > > >>>> On 9/15/19 8:16 AM, Guillaume Lema?tre wrote: >> > > > > >>>>> >> > > > > >>>>> >> > > > > >>>>> On Sat, 14 Sep 2019 at 20:59, C W >> wrote: >> > > > > >>>>> Thanks, Guillaume. >> > > > > >>>>> Column transformer looks pretty neat. I've also heard >> though, >> > > this pipeline can be tedious to set up? Specifying what you want for >> > every >> > > feature is a pain. >> > > > > >>>>> >> > > > > >>>>> It would be interesting for us which part of the pipeline is >> > > tedious to set up to know if we can improve something there. >> > > > > >>>>> Do you mean, that you would like to automatically detect of >> > > which type of feature (categorical/numerical) and apply a >> > > > > >>>>> default encoder/scaling such as discuss there: >> > > >> > >> https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127 >> > > > > >>>>> >> > > > > >>>>> IMO, one a user perspective, it would be cleaner in some >> cases >> > > at the cost of applying blindly a black box >> > > > > >>>>> which might be dangerous. >> > > > > >>>> Also see >> > > >> > >> https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor >> > > > > >>>> Which basically does that. >> > > > > >>>> >> > > > > >>>> >> > > > > >>>>> >> > > > > >>>>> >> > > > > >>>>> Jaiver, >> > > > > >>>>> Actually, you guessed right. My real data has only one >> > numerical >> > > variable, looks more like this: >> > > > > >>>>> >> > > > > >>>>> Gender Date Income Car Attendance >> > > > > >>>>> Male 2019/3/01 10000 BMW Yes >> > > > > >>>>> Female 2019/5/02 9000 Toyota No >> > > > > >>>>> Male 2019/7/15 12000 Audi Yes >> > > > > >>>>> >> > > > > >>>>> I am predicting income using all other categorical >> variables. >> > > Maybe it is catboost! >> > > > > >>>>> >> > > > > >>>>> Thanks, >> > > > > >>>>> >> > > > > >>>>> M >> > > > > >>>>> >> > > > > >>>>> >> > > > > >>>>> >> > > > > >>>>> >> > > > > >>>>> >> > > > > >>>>> >> > > > > >>>>> On Sat, Sep 14, 2019 at 9:25 AM Javier L?pez > > >> > > wrote: >> > > > > >>>>> If you have datasets with many categorical features, and >> > perhaps >> > > many categories, the tools in sklearn are quite limited, >> > > > > >>>>> but there are alternative implementations of boosted trees >> that >> > > are designed with categorical features in mind. Take a look >> > > > > >>>>> at catboost [1], which has an sklearn-compatible API. >> > > > > >>>>> >> > > > > >>>>> J >> > > > > >>>>> >> > > > > >>>>> [1] https://catboost.ai/ >> > > > > >>>>> >> > > > > >>>>> On Sat, Sep 14, 2019 at 3:40 AM C W >> wrote: >> > > > > >>>>> Hello all, >> > > > > >>>>> I'm very confused. Can the decision tree module handle both >> > > continuous and categorical features in the dataset? In this case, it's >> > just >> > > CART (Classification and Regression Trees). >> > > > > >>>>> >> > > > > >>>>> For example, >> > > > > >>>>> Gender Age Income Car Attendance >> > > > > >>>>> Male 30 10000 BMW Yes >> > > > > >>>>> Female 35 9000 Toyota No >> > > > > >>>>> Male 50 12000 Audi Yes >> > > > > >>>>> >> > > > > >>>>> According to the documentation >> > > >> > >> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart >> > , >> > > it can not! >> > > > > >>>>> >> > > > > >>>>> It says: "scikit-learn implementation does not support >> > > categorical variables for now". >> > > > > >>>>> >> > > > > >>>>> Is this true? If not, can someone point me to an example? If >> > > yes, what do people do? >> > > > > >>>>> >> > > > > >>>>> Thank you very much! >> > > > > >>>>> >> > > > > >>>>> >> > > > > >>>>> >> > > > > >>>>> _______________________________________________ >> > > > > >>>>> scikit-learn mailing list >> > > > > >>>>> scikit-learn at python.org >> > > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > > >>>>> _______________________________________________ >> > > > > >>>>> scikit-learn mailing list >> > > > > >>>>> scikit-learn at python.org >> > > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > > >>>>> _______________________________________________ >> > > > > >>>>> scikit-learn mailing list >> > > > > >>>>> scikit-learn at python.org >> > > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > > >>>>> >> > > > > >>>>> >> > > > > >>>>> -- >> > > > > >>>>> Guillaume Lemaitre >> > > > > >>>>> INRIA Saclay - Parietal team >> > > > > >>>>> Center for Data Science Paris-Saclay >> > > > > >>>>> https://glemaitre.github.io/ >> > > > > >>>>> >> > > > > >>>>> >> > > > > >>>>> _______________________________________________ >> > > > > >>>>> scikit-learn mailing list >> > > > > >>>>> >> > > > > >>>>> scikit-learn at python.org >> > > > > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > > >>>> >> > > > > >>>> _______________________________________________ >> > > > > >>>> scikit-learn mailing list >> > > > > >>>> scikit-learn at python.org >> > > > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > > >>>> _______________________________________________ >> > > > > >>>> scikit-learn mailing list >> > > > > >>>> scikit-learn at python.org >> > > > > >>>> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > > >>> >> > > > > >>> _______________________________________________ >> > > > > >>> scikit-learn mailing list >> > > > > >>> scikit-learn at python.org >> > > > > >>> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > > >>> >> > > > > >>> >> > > > > >>> _______________________________________________ >> > > > > >>> scikit-learn mailing list >> > > > > >>> >> > > > > >>> scikit-learn at python.org >> > > > > >>> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > > >> _______________________________________________ >> > > > > >> scikit-learn mailing list >> > > > > >> scikit-learn at python.org >> > > > > >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > > > >> > > > > > _______________________________________________ >> > > > > > scikit-learn mailing list >> > > > > > scikit-learn at python.org >> > > > > > https://mail.python.org/mailman/listinfo/scikit-learn >> > > > > > _______________________________________________ >> > > > > > scikit-learn mailing list >> > > > > > scikit-learn at python.org >> > > > > > https://mail.python.org/mailman/listinfo/scikit-learn >> > > > > >> > > > > _______________________________________________ >> > > > > scikit-learn mailing list >> > > > > scikit-learn at python.org >> > > > > https://mail.python.org/mailman/listinfo/scikit-learn >> > > > > _______________________________________________ >> > > > > scikit-learn mailing list >> > > > > scikit-learn at python.org >> > > > > https://mail.python.org/mailman/listinfo/scikit-learn >> > > > >> > > > _______________________________________________ >> > > > scikit-learn mailing list >> > > > scikit-learn at python.org >> > > > https://mail.python.org/mailman/listinfo/scikit-learn >> > > > _______________________________________________ >> > > > scikit-learn mailing list >> > > > scikit-learn at python.org >> > > > https://mail.python.org/mailman/listinfo/scikit-learn >> > > >> > > _______________________________________________ >> > > scikit-learn mailing list >> > > scikit-learn at python.org >> > > https://mail.python.org/mailman/listinfo/scikit-learn >> > > >> > -------------- next part -------------- >> > An HTML attachment was scrubbed... >> > URL: < >> > >> http://mail.python.org/pipermail/scikit-learn/attachments/20191005/7234be32/attachment.html >> > > >> > >> > ------------------------------ >> > >> > Subject: Digest Footer >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > >> > ------------------------------ >> > >> > End of scikit-learn Digest, Vol 43, Issue 10 >> > ******************************************** >> > >> -------------- next part -------------- >> An HTML attachment was scrubbed... >> URL: < >> http://mail.python.org/pipermail/scikit-learn/attachments/20191005/14272924/attachment.html >> > >> >> ------------------------------ >> >> Subject: Digest Footer >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> ------------------------------ >> >> End of scikit-learn Digest, Vol 43, Issue 11 >> ******************************************** >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hemanth.genie at gmail.com Tue Oct 8 07:47:14 2019 From: hemanth.genie at gmail.com (Hemanth Kota) Date: Tue, 8 Oct 2019 17:17:14 +0530 Subject: [scikit-learn] Regarding design decision for putting Data Scaler and Feature Transformers under same module Message-ID: Hi Team, I'm beginner in using sklearn library. I have doubt regarding reason for putting Data scalers like StandardScaler, RobustScaler etc and feature transformers like QuantileTransformer, PolynomialFeatures in preprocessing module ? What relationship made to put them together? Thanks Hemanth -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Tue Oct 8 07:54:28 2019 From: g.lemaitre58 at gmail.com (=?ISO-8859-1?Q?Guillaume_Lema=EEtre?=) Date: Tue, 08 Oct 2019 13:54:28 +0200 Subject: [scikit-learn] Regarding design decision for putting Data Scaler and Feature Transformers under same module In-Reply-To: Message-ID: An HTML attachment was scrubbed... URL: From hemanth.genie at gmail.com Tue Oct 8 07:59:29 2019 From: hemanth.genie at gmail.com (Hemanth Kota) Date: Tue, 8 Oct 2019 17:29:29 +0530 Subject: [scikit-learn] Regarding design decision for putting Data Scaler and Feature Transformers under same module In-Reply-To: References: Message-ID: Simple reason. Thanks Hemanth On Tue, Oct 8, 2019, 5:26 PM Guillaume Lema?tre wrote: > You all apply them before to use any machine learning algorithm. They are > preprocessing methods. > > Sent from my phone - sorry to be brief and potential misspell. > *From:* hemanth.genie at gmail.com > *Sent:* 8 October 2019 14:49 > *To:* scikit-learn at python.org > *Reply to:* scikit-learn at python.org > *Subject:* [scikit-learn] Regarding design decision for putting Data > Scaler and Feature Transformers under same module > > Hi Team, > > I'm beginner in using sklearn library. I have doubt regarding reason for > putting Data scalers like StandardScaler, RobustScaler etc and feature > transformers like QuantileTransformer, PolynomialFeatures in preprocessing > module ? What relationship made to put them together? > > Thanks > Hemanth > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benoit.presles at u-bourgogne.fr Tue Oct 8 13:19:57 2019 From: benoit.presles at u-bourgogne.fr (=?UTF-8?Q?Beno=c3=aet_Presles?=) Date: Tue, 8 Oct 2019 19:19:57 +0200 Subject: [scikit-learn] logistic regression results are not stable between solvers Message-ID: Dear scikit-learn users, I am using logistic regression to make some predictions. On my own data, I do not get the same results between solvers. I managed to reproduce this issue on synthetic data (see the code below). All solvers seem to converge (n_iter_ < max_iter), so why do I get different results? If results between solvers are not stable, which one to choose? Best regards, Ben ------------------------------------------ Here is the code I used to generate synthetic data: from sklearn.datasets import make_classification from sklearn.model_selection import StratifiedShuffleSplit from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression # RANDOM_SEED = 2 # X_sim, y_sim = make_classification(n_samples=200, ?????????????????????????? n_features=45, ?????????????????????????? n_informative=10, ?????????????????????????? n_redundant=0, ?????????????????????????? n_repeated=0, ?????????????????????????? n_classes=2, ?????????????????????????? n_clusters_per_class=1, ?????????????????????????? random_state=RANDOM_SEED, ?????????????????????????? shuffle=False) # sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=RANDOM_SEED) for train_index_split, test_index_split in sss.split(X_sim, y_sim): ??? X_split_train, X_split_test = X_sim[train_index_split], X_sim[test_index_split] ??? y_split_train, y_split_test = y_sim[train_index_split], y_sim[test_index_split] ??? ss = StandardScaler() ??? X_split_train = ss.fit_transform(X_split_train) ??? X_split_test = ss.transform(X_split_test) ??? # ??? classifier_lbfgs = LogisticRegression(fit_intercept=True, max_iter=20000000, verbose=1, random_state=RANDOM_SEED, C=1e9, ??????????????????????????????????? solver='lbfgs') ??? classifier_lbfgs.fit(X_split_train, y_split_train) ??? print('classifier lbfgs iter:',? classifier_lbfgs.n_iter_) ??? classifier_saga = LogisticRegression(fit_intercept=True, max_iter=20000000, verbose=1, random_state=RANDOM_SEED, C=1e9, ??????????????????????????????????? solver='saga') ??? classifier_saga.fit(X_split_train, y_split_train) ??? print('classifier saga iter:', classifier_saga.n_iter_) ??? # ??? y_pred_lbfgs = classifier_lbfgs.predict(X_split_test) ??? y_pred_saga? = classifier_saga.predict(X_split_test) ??? # ??? if (y_pred_lbfgs==y_pred_saga).all() == False: ??????? print('lbfgs does not give the same results as saga :-( !') ??????? exit() From t3kcit at gmail.com Tue Oct 8 13:51:22 2019 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 8 Oct 2019 19:51:22 +0200 Subject: [scikit-learn] logistic regression results are not stable between solvers In-Reply-To: References: Message-ID: I'm pretty sure SAGA is not converging. Unless you scale the data, SAGA is very slow to converge. On 10/8/19 7:19 PM, Beno?t Presles wrote: > Dear scikit-learn users, > > I am using logistic regression to make some predictions. On my own > data, I do not get the same results between solvers. I managed to > reproduce this issue on synthetic data (see the code below). > All solvers seem to converge (n_iter_ < max_iter), so why do I get > different results? > If results between solvers are not stable, which one to choose? > > > Best regards, > Ben > > ------------------------------------------ > > Here is the code I used to generate synthetic data: > > from sklearn.datasets import make_classification > from sklearn.model_selection import StratifiedShuffleSplit > from sklearn.preprocessing import StandardScaler > from sklearn.linear_model import LogisticRegression > # > RANDOM_SEED = 2 > # > X_sim, y_sim = make_classification(n_samples=200, > ?????????????????????????? n_features=45, > ?????????????????????????? n_informative=10, > ?????????????????????????? n_redundant=0, > ?????????????????????????? n_repeated=0, > ?????????????????????????? n_classes=2, > ?????????????????????????? n_clusters_per_class=1, > ?????????????????????????? random_state=RANDOM_SEED, > ?????????????????????????? shuffle=False) > # > sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, > random_state=RANDOM_SEED) > for train_index_split, test_index_split in sss.split(X_sim, y_sim): > ??? X_split_train, X_split_test = X_sim[train_index_split], > X_sim[test_index_split] > ??? y_split_train, y_split_test = y_sim[train_index_split], > y_sim[test_index_split] > ??? ss = StandardScaler() > ??? X_split_train = ss.fit_transform(X_split_train) > ??? X_split_test = ss.transform(X_split_test) > ??? # > ??? classifier_lbfgs = LogisticRegression(fit_intercept=True, > max_iter=20000000, verbose=1, random_state=RANDOM_SEED, C=1e9, > ??????????????????????????????????? solver='lbfgs') > ??? classifier_lbfgs.fit(X_split_train, y_split_train) > ??? print('classifier lbfgs iter:',? classifier_lbfgs.n_iter_) > ??? classifier_saga = LogisticRegression(fit_intercept=True, > max_iter=20000000, verbose=1, random_state=RANDOM_SEED, C=1e9, > ??????????????????????????????????? solver='saga') > ??? classifier_saga.fit(X_split_train, y_split_train) > ??? print('classifier saga iter:', classifier_saga.n_iter_) > ??? # > ??? y_pred_lbfgs = classifier_lbfgs.predict(X_split_test) > ??? y_pred_saga? = classifier_saga.predict(X_split_test) > ??? # > ??? if (y_pred_lbfgs==y_pred_saga).all() == False: > ??????? print('lbfgs does not give the same results as saga :-( !') > ??????? exit() > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From benoit.presles at u-bourgogne.fr Tue Oct 8 14:19:50 2019 From: benoit.presles at u-bourgogne.fr (=?utf-8?Q?Beno=C3=AEt_Presles?=) Date: Tue, 8 Oct 2019 20:19:50 +0200 Subject: [scikit-learn] logistic regression results are not stable between solvers In-Reply-To: References: Message-ID: <1F1286A9-D63A-4E78-8474-B24C7FCB8B4B@u-bourgogne.fr> As you can notice in the code below, I do scale the data. I do not get any convergence warning and moreover I always have n_iter_ < max_iter. > Le 8 oct. 2019 ? 19:51, Andreas Mueller a ?crit : > > I'm pretty sure SAGA is not converging. Unless you scale the data, SAGA is very slow to converge. > >> On 10/8/19 7:19 PM, Beno?t Presles wrote: >> Dear scikit-learn users, >> >> I am using logistic regression to make some predictions. On my own data, I do not get the same results between solvers. I managed to reproduce this issue on synthetic data (see the code below). >> All solvers seem to converge (n_iter_ < max_iter), so why do I get different results? >> If results between solvers are not stable, which one to choose? >> >> >> Best regards, >> Ben >> >> ------------------------------------------ >> >> Here is the code I used to generate synthetic data: >> >> from sklearn.datasets import make_classification >> from sklearn.model_selection import StratifiedShuffleSplit >> from sklearn.preprocessing import StandardScaler >> from sklearn.linear_model import LogisticRegression >> # >> RANDOM_SEED = 2 >> # >> X_sim, y_sim = make_classification(n_samples=200, >> n_features=45, >> n_informative=10, >> n_redundant=0, >> n_repeated=0, >> n_classes=2, >> n_clusters_per_class=1, >> random_state=RANDOM_SEED, >> shuffle=False) >> # >> sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=RANDOM_SEED) >> for train_index_split, test_index_split in sss.split(X_sim, y_sim): >> X_split_train, X_split_test = X_sim[train_index_split], X_sim[test_index_split] >> y_split_train, y_split_test = y_sim[train_index_split], y_sim[test_index_split] >> ss = StandardScaler() >> X_split_train = ss.fit_transform(X_split_train) >> X_split_test = ss.transform(X_split_test) >> # >> classifier_lbfgs = LogisticRegression(fit_intercept=True, max_iter=20000000, verbose=1, random_state=RANDOM_SEED, C=1e9, >> solver='lbfgs') >> classifier_lbfgs.fit(X_split_train, y_split_train) >> print('classifier lbfgs iter:', classifier_lbfgs.n_iter_) >> classifier_saga = LogisticRegression(fit_intercept=True, max_iter=20000000, verbose=1, random_state=RANDOM_SEED, C=1e9, >> solver='saga') >> classifier_saga.fit(X_split_train, y_split_train) >> print('classifier saga iter:', classifier_saga.n_iter_) >> # >> y_pred_lbfgs = classifier_lbfgs.predict(X_split_test) >> y_pred_saga = classifier_saga.predict(X_split_test) >> # >> if (y_pred_lbfgs==y_pred_saga).all() == False: >> print('lbfgs does not give the same results as saga :-( !') >> exit() >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From benoit.presles at u-bourgogne.fr Wed Oct 9 13:21:53 2019 From: benoit.presles at u-bourgogne.fr (=?UTF-8?Q?Beno=c3=aet_Presles?=) Date: Wed, 9 Oct 2019 19:21:53 +0200 Subject: [scikit-learn] logistic regression results are not stable between solvers In-Reply-To: <1F1286A9-D63A-4E78-8474-B24C7FCB8B4B@u-bourgogne.fr> References: <1F1286A9-D63A-4E78-8474-B24C7FCB8B4B@u-bourgogne.fr> Message-ID: Dear scikit-learn users, Do you think it is a bug in scikit-learn? Best regards, Ben Le 08/10/2019 ? 20:19, Beno?t Presles a ?crit?: > As you can notice in the code below, I do scale the data. I do not get any convergence warning and moreover I always have n_iter_ < max_iter. > > >> Le 8 oct. 2019 ? 19:51, Andreas Mueller a ?crit : >> >> I'm pretty sure SAGA is not converging. Unless you scale the data, SAGA is very slow to converge. >> >>> On 10/8/19 7:19 PM, Beno?t Presles wrote: >>> Dear scikit-learn users, >>> >>> I am using logistic regression to make some predictions. On my own data, I do not get the same results between solvers. I managed to reproduce this issue on synthetic data (see the code below). >>> All solvers seem to converge (n_iter_ < max_iter), so why do I get different results? >>> If results between solvers are not stable, which one to choose? >>> >>> >>> Best regards, >>> Ben >>> >>> ------------------------------------------ >>> >>> Here is the code I used to generate synthetic data: >>> >>> from sklearn.datasets import make_classification >>> from sklearn.model_selection import StratifiedShuffleSplit >>> from sklearn.preprocessing import StandardScaler >>> from sklearn.linear_model import LogisticRegression >>> # >>> RANDOM_SEED = 2 >>> # >>> X_sim, y_sim = make_classification(n_samples=200, >>> n_features=45, >>> n_informative=10, >>> n_redundant=0, >>> n_repeated=0, >>> n_classes=2, >>> n_clusters_per_class=1, >>> random_state=RANDOM_SEED, >>> shuffle=False) >>> # >>> sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=RANDOM_SEED) >>> for train_index_split, test_index_split in sss.split(X_sim, y_sim): >>> X_split_train, X_split_test = X_sim[train_index_split], X_sim[test_index_split] >>> y_split_train, y_split_test = y_sim[train_index_split], y_sim[test_index_split] >>> ss = StandardScaler() >>> X_split_train = ss.fit_transform(X_split_train) >>> X_split_test = ss.transform(X_split_test) >>> # >>> classifier_lbfgs = LogisticRegression(fit_intercept=True, max_iter=20000000, verbose=1, random_state=RANDOM_SEED, C=1e9, >>> solver='lbfgs') >>> classifier_lbfgs.fit(X_split_train, y_split_train) >>> print('classifier lbfgs iter:', classifier_lbfgs.n_iter_) >>> classifier_saga = LogisticRegression(fit_intercept=True, max_iter=20000000, verbose=1, random_state=RANDOM_SEED, C=1e9, >>> solver='saga') >>> classifier_saga.fit(X_split_train, y_split_train) >>> print('classifier saga iter:', classifier_saga.n_iter_) >>> # >>> y_pred_lbfgs = classifier_lbfgs.predict(X_split_test) >>> y_pred_saga = classifier_saga.predict(X_split_test) >>> # >>> if (y_pred_lbfgs==y_pred_saga).all() == False: >>> print('lbfgs does not give the same results as saga :-( !') >>> exit() >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From g.lemaitre58 at gmail.com Wed Oct 9 14:25:11 2019 From: g.lemaitre58 at gmail.com (=?ISO-8859-1?Q?Guillaume_Lema=EEtre?=) Date: Wed, 09 Oct 2019 21:25:11 +0300 Subject: [scikit-learn] logistic regression results are not stable between solvers In-Reply-To: Message-ID: Could you generate more samples, set penalty to none, reduce the tolerance and check the coefficients instead of predictions. This is sure to be sure that this is not only a numerical error. Sent from my phone - sorry to be brief and potential misspell. ? Original Message ? From: benoit.presles at u-bourgogne.fr Sent: 8 October 2019 20:27 To: scikit-learn at python.org Reply to: scikit-learn at python.org Subject: [scikit-learn] logistic regression results are not stable between solvers Dear scikit-learn users, I am using logistic regression to make some predictions. On my own data, I do not get the same results between solvers. I managed to reproduce this issue on synthetic data (see the code below). All solvers seem to converge (n_iter_ < max_iter), so why do I get different results? If results between solvers are not stable, which one to choose? Best regards, Ben ------------------------------------------ Here is the code I used to generate synthetic data: from sklearn.datasets import make_classification from sklearn.model_selection import StratifiedShuffleSplit from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression # RANDOM_SEED = 2 # X_sim, y_sim = make_classification(n_samples=200, ?????????????????????????? n_features=45, ?????????????????????????? n_informative=10, ?????????????????????????? n_redundant=0, ?????????????????????????? n_repeated=0, ?????????????????????????? n_classes=2, ?????????????????????????? n_clusters_per_class=1, ?????????????????????????? random_state=RANDOM_SEED, ?????????????????????????? shuffle=False) # sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=RANDOM_SEED) for train_index_split, test_index_split in sss.split(X_sim, y_sim): ??? X_split_train, X_split_test = X_sim[train_index_split], X_sim[test_index_split] ??? y_split_train, y_split_test = y_sim[train_index_split], y_sim[test_index_split] ??? ss = StandardScaler() ??? X_split_train = ss.fit_transform(X_split_train) ??? X_split_test = ss.transform(X_split_test) ??? # ??? classifier_lbfgs = LogisticRegression(fit_intercept=True, max_iter=20000000, verbose=1, random_state=RANDOM_SEED, C=1e9, ??????????????????????????????????? solver='lbfgs') ??? classifier_lbfgs.fit(X_split_train, y_split_train) ??? print('classifier lbfgs iter:',? classifier_lbfgs.n_iter_) ??? classifier_saga = LogisticRegression(fit_intercept=True, max_iter=20000000, verbose=1, random_state=RANDOM_SEED, C=1e9, ??????????????????????????????????? solver='saga') ??? classifier_saga.fit(X_split_train, y_split_train) ??? print('classifier saga iter:', classifier_saga.n_iter_) ??? # ??? y_pred_lbfgs = classifier_lbfgs.predict(X_split_test) ??? y_pred_saga? = classifier_saga.predict(X_split_test) ??? # ??? if (y_pred_lbfgs==y_pred_saga).all() == False: ??????? print('lbfgs does not give the same results as saga :-( !') ??????? exit() _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn From benoit.presles at u-bourgogne.fr Wed Oct 9 15:44:46 2019 From: benoit.presles at u-bourgogne.fr (=?UTF-8?Q?Beno=c3=aet_Presles?=) Date: Wed, 9 Oct 2019 21:44:46 +0200 Subject: [scikit-learn] logistic regression results are not stable between solvers In-Reply-To: References: Message-ID: <5591ab4c-6a15-2910-c592-0c019b1a6600@u-bourgogne.fr> Dear scikit-learn users, I did what you suggested (see code below) and I still do not get the same results between solvers. I do not have the same predictions and I do not have the same coefficients. Best regards, Ben Here is the new source code: from sklearn.datasets import make_classification from sklearn.model_selection import StratifiedShuffleSplit from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression # RANDOM_SEED = 2 # X_sim, y_sim = make_classification(n_samples=400, ?????????????????????????? n_features=45, ?????????????????????????? n_informative=10, ?????????????????????????? n_redundant=0, ?????????????????????????? n_repeated=0, ?????????????????????????? n_classes=2, ?????????????????????????? n_clusters_per_class=1, ?????????????????????????? random_state=RANDOM_SEED, ?????????????????????????? shuffle=False) # sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=RANDOM_SEED) for train_index_split, test_index_split in sss.split(X_sim, y_sim): ??? X_split_train, X_split_test = X_sim[train_index_split], X_sim[test_index_split] ??? y_split_train, y_split_test = y_sim[train_index_split], y_sim[test_index_split] ??? ss = StandardScaler() ??? X_split_train = ss.fit_transform(X_split_train) ??? X_split_test = ss.transform(X_split_test) ??? # ??? classifier_lbfgs = LogisticRegression(fit_intercept=True, max_iter=20000000, verbose=0, random_state=RANDOM_SEED, C=1e9, ??????????????????????????????????? solver='lbfgs', penalty='none', tol=1e-6) ??? classifier_lbfgs.fit(X_split_train, y_split_train) ??? print('classifier lbfgs iter:',? classifier_lbfgs.n_iter_) ??? print(classifier_lbfgs.coef_) ??? classifier_saga = LogisticRegression(fit_intercept=True, max_iter=20000000, verbose=0, random_state=RANDOM_SEED, C=1e9, ??????????????????????????????????? solver='saga', penalty='none', tol=1e-6) ??? classifier_saga.fit(X_split_train, y_split_train) ??? print('classifier saga iter:', classifier_saga.n_iter_) ??? print(classifier_saga.coef_) ??? # ??? y_pred_lbfgs = classifier_lbfgs.predict(X_split_test) ??? y_pred_saga? = classifier_saga.predict(X_split_test) ??? # ??? if (y_pred_lbfgs==y_pred_saga).all() == False: ??????? print('lbfgs does not give the same results as saga :-( !') ??????? exit(1) Le 09/10/2019 ? 20:25, Guillaume Lema?tre a ?crit?: > Could you generate more samples, set penalty to none, reduce the tolerance and check the coefficients instead of predictions. This is sure to be sure that this is not only a numerical error. > > > > > Sent from my phone - sorry to be brief and potential misspell. > > > > ? Original Message > > > > From: benoit.presles at u-bourgogne.fr > Sent: 8 October 2019 20:27 > To: scikit-learn at python.org > Reply to: scikit-learn at python.org > Subject: [scikit-learn] logistic regression results are not stable between solvers > > > Dear scikit-learn users, > > I am using logistic regression to make some predictions. On my own data, > I do not get the same results between solvers. I managed to reproduce > this issue on synthetic data (see the code below). > All solvers seem to converge (n_iter_ < max_iter), so why do I get > different results? > If results between solvers are not stable, which one to choose? > > > Best regards, > Ben > > ------------------------------------------ > > Here is the code I used to generate synthetic data: > > from sklearn.datasets import make_classification > from sklearn.model_selection import StratifiedShuffleSplit > from sklearn.preprocessing import StandardScaler > from sklearn.linear_model import LogisticRegression > # > RANDOM_SEED = 2 > # > X_sim, y_sim = make_classification(n_samples=200, > ?????????????????????????? n_features=45, > ?????????????????????????? n_informative=10, > ?????????????????????????? n_redundant=0, > ?????????????????????????? n_repeated=0, > ?????????????????????????? n_classes=2, > ?????????????????????????? n_clusters_per_class=1, > ?????????????????????????? random_state=RANDOM_SEED, > ?????????????????????????? shuffle=False) > # > sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, > random_state=RANDOM_SEED) > for train_index_split, test_index_split in sss.split(X_sim, y_sim): > ??? X_split_train, X_split_test = X_sim[train_index_split], > X_sim[test_index_split] > ??? y_split_train, y_split_test = y_sim[train_index_split], > y_sim[test_index_split] > ??? ss = StandardScaler() > ??? X_split_train = ss.fit_transform(X_split_train) > ??? X_split_test = ss.transform(X_split_test) > ??? # > ??? classifier_lbfgs = LogisticRegression(fit_intercept=True, > max_iter=20000000, verbose=1, random_state=RANDOM_SEED, C=1e9, > ??????????????????????????????????? solver='lbfgs') > ??? classifier_lbfgs.fit(X_split_train, y_split_train) > ??? print('classifier lbfgs iter:',? classifier_lbfgs.n_iter_) > ??? classifier_saga = LogisticRegression(fit_intercept=True, > max_iter=20000000, verbose=1, random_state=RANDOM_SEED, C=1e9, > ??????????????????????????????????? solver='saga') > ??? classifier_saga.fit(X_split_train, y_split_train) > ??? print('classifier saga iter:', classifier_saga.n_iter_) > ??? # > ??? y_pred_lbfgs = classifier_lbfgs.predict(X_split_test) > ??? y_pred_saga? = classifier_saga.predict(X_split_test) > ??? # > ??? if (y_pred_lbfgs==y_pred_saga).all() == False: > ??????? print('lbfgs does not give the same results as saga :-( !') > ??????? exit() > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From seralouk at hotmail.com Wed Oct 9 16:10:44 2019 From: seralouk at hotmail.com (serafim loukas) Date: Wed, 9 Oct 2019 20:10:44 +0000 Subject: [scikit-learn] logistic regression results are not stable between solvers In-Reply-To: <5591ab4c-6a15-2910-c592-0c019b1a6600@u-bourgogne.fr> References: <5591ab4c-6a15-2910-c592-0c019b1a6600@u-bourgogne.fr> Message-ID: <44B72247-308C-42A4-B4E1-DFD1BDFC5058@hotmail.com> The predictions across solver are exactly the same when I run the code. I am using 0.21.3 version. What is yours? In [13]: import sklearn In [14]: sklearn.__version__ Out[14]: '0.21.3' Serafeim On 9 Oct 2019, at 21:44, Beno?t Presles > wrote: (y_pred_lbfgs==y_pred_saga).all() == False -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at gmail.com Wed Oct 9 17:20:46 2019 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Wed, 9 Oct 2019 23:20:46 +0200 Subject: [scikit-learn] logistic regression results are not stable between solvers In-Reply-To: <44B72247-308C-42A4-B4E1-DFD1BDFC5058@hotmail.com> References: <5591ab4c-6a15-2910-c592-0c019b1a6600@u-bourgogne.fr> <44B72247-308C-42A4-B4E1-DFD1BDFC5058@hotmail.com> Message-ID: <586c6024-9bef-3ab8-513d-547913808039@gmail.com> Ben, I can confirm your results with penalty='none' and C=1e9. In both cases, you are running a mostly unpenalized logisitic regression. Usually that's less numerically stable than with a small regularization, depending on the data collinearity. Running that same code with - larger penalty ( smaller C values) - or larger number of samples yields for me the same coefficients (up to some tolerance). You can also see that SAGA convergence is not good by the fact that it needs 196000 epochs/iterations to converge. Actually, I have often seen convergence issues with SAG on small datasets (in unit tests), not fully sure why. -- Roman On 09/10/2019 22:10, serafim loukas wrote: > The predictions across solver are exactly the same when I run the code. > I am using 0.21.3 version. What is yours? > > > In [13]: import sklearn > > In [14]: sklearn.__version__ > Out[14]: '0.21.3' > > > Serafeim > > > >> On 9 Oct 2019, at 21:44, Beno?t Presles > > wrote: >> >> (y_pred_lbfgs==y_pred_saga).all() == False > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From g.lemaitre58 at gmail.com Wed Oct 9 17:36:07 2019 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Wed, 9 Oct 2019 23:36:07 +0200 Subject: [scikit-learn] logistic regression results are not stable between solvers In-Reply-To: <586c6024-9bef-3ab8-513d-547913808039@gmail.com> References: <5591ab4c-6a15-2910-c592-0c019b1a6600@u-bourgogne.fr> <44B72247-308C-42A4-B4E1-DFD1BDFC5058@hotmail.com> <586c6024-9bef-3ab8-513d-547913808039@gmail.com> Message-ID: I slightly change the bench such that it uses pipeline and plotted the coefficient: https://gist.github.com/glemaitre/8fcc24bdfc7dc38ca0c09c56e26b9386 I only see one of the 10 splits where SAGA is not converging, otherwise the coefficients look very close (I don't attach the figure here but they can be plotted using the snippet). So apart from this second split, the other differences seems to be numerical instability. Where I have some concern is regarding the convergence rate of SAGA but I have no intuition to know if this is normal or not. On Wed, 9 Oct 2019 at 23:22, Roman Yurchak wrote: > Ben, > > I can confirm your results with penalty='none' and C=1e9. In both cases, > you are running a mostly unpenalized logisitic regression. Usually > that's less numerically stable than with a small regularization, > depending on the data collinearity. > > Running that same code with > - larger penalty ( smaller C values) > - or larger number of samples > yields for me the same coefficients (up to some tolerance). > > You can also see that SAGA convergence is not good by the fact that it > needs 196000 epochs/iterations to converge. > > Actually, I have often seen convergence issues with SAG on small > datasets (in unit tests), not fully sure why. > > -- > Roman > > On 09/10/2019 22:10, serafim loukas wrote: > > The predictions across solver are exactly the same when I run the code. > > I am using 0.21.3 version. What is yours? > > > > > > In [13]: import sklearn > > > > In [14]: sklearn.__version__ > > Out[14]: '0.21.3' > > > > > > Serafeim > > > > > > > >> On 9 Oct 2019, at 21:44, Beno?t Presles >> > wrote: > >> > >> (y_pred_lbfgs==y_pred_saga).all() == False > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Wed Oct 9 17:37:57 2019 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Wed, 9 Oct 2019 23:37:57 +0200 Subject: [scikit-learn] logistic regression results are not stable between solvers In-Reply-To: References: <5591ab4c-6a15-2910-c592-0c019b1a6600@u-bourgogne.fr> <44B72247-308C-42A4-B4E1-DFD1BDFC5058@hotmail.com> <586c6024-9bef-3ab8-513d-547913808039@gmail.com> Message-ID: Uhm actually increasing to 10000 samples solve the convergence issue. SAGA is not designed to work with a so small sample size most probably. On Wed, 9 Oct 2019 at 23:36, Guillaume Lema?tre wrote: > I slightly change the bench such that it uses pipeline and plotted the > coefficient: > > https://gist.github.com/glemaitre/8fcc24bdfc7dc38ca0c09c56e26b9386 > > I only see one of the 10 splits where SAGA is not converging, otherwise > the coefficients > look very close (I don't attach the figure here but they can be plotted > using the snippet). > So apart from this second split, the other differences seems to be > numerical instability. > > Where I have some concern is regarding the convergence rate of SAGA but I > have no > intuition to know if this is normal or not. > > On Wed, 9 Oct 2019 at 23:22, Roman Yurchak wrote: > >> Ben, >> >> I can confirm your results with penalty='none' and C=1e9. In both cases, >> you are running a mostly unpenalized logisitic regression. Usually >> that's less numerically stable than with a small regularization, >> depending on the data collinearity. >> >> Running that same code with >> - larger penalty ( smaller C values) >> - or larger number of samples >> yields for me the same coefficients (up to some tolerance). >> >> You can also see that SAGA convergence is not good by the fact that it >> needs 196000 epochs/iterations to converge. >> >> Actually, I have often seen convergence issues with SAG on small >> datasets (in unit tests), not fully sure why. >> >> -- >> Roman >> >> On 09/10/2019 22:10, serafim loukas wrote: >> > The predictions across solver are exactly the same when I run the code. >> > I am using 0.21.3 version. What is yours? >> > >> > >> > In [13]: import sklearn >> > >> > In [14]: sklearn.__version__ >> > Out[14]: '0.21.3' >> > >> > >> > Serafeim >> > >> > >> > >> >> On 9 Oct 2019, at 21:44, Beno?t Presles > >> > wrote: >> >> >> >> (y_pred_lbfgs==y_pred_saga).all() == False >> > >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > Guillaume Lemaitre > Scikit-learn @ Inria Foundation > https://glemaitre.github.io/ > -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Wed Oct 9 17:39:05 2019 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Wed, 9 Oct 2019 23:39:05 +0200 Subject: [scikit-learn] logistic regression results are not stable between solvers In-Reply-To: References: <5591ab4c-6a15-2910-c592-0c019b1a6600@u-bourgogne.fr> <44B72247-308C-42A4-B4E1-DFD1BDFC5058@hotmail.com> <586c6024-9bef-3ab8-513d-547913808039@gmail.com> Message-ID: Ups I did not see the answer of Roman. Sorry about that. It is coming back to the same conclusion :) On Wed, 9 Oct 2019 at 23:37, Guillaume Lema?tre wrote: > Uhm actually increasing to 10000 samples solve the convergence issue. > SAGA is not designed to work with a so small sample size most probably. > > On Wed, 9 Oct 2019 at 23:36, Guillaume Lema?tre > wrote: > >> I slightly change the bench such that it uses pipeline and plotted the >> coefficient: >> >> https://gist.github.com/glemaitre/8fcc24bdfc7dc38ca0c09c56e26b9386 >> >> I only see one of the 10 splits where SAGA is not converging, otherwise >> the coefficients >> look very close (I don't attach the figure here but they can be plotted >> using the snippet). >> So apart from this second split, the other differences seems to be >> numerical instability. >> >> Where I have some concern is regarding the convergence rate of SAGA but I >> have no >> intuition to know if this is normal or not. >> >> On Wed, 9 Oct 2019 at 23:22, Roman Yurchak wrote: >> >>> Ben, >>> >>> I can confirm your results with penalty='none' and C=1e9. In both cases, >>> you are running a mostly unpenalized logisitic regression. Usually >>> that's less numerically stable than with a small regularization, >>> depending on the data collinearity. >>> >>> Running that same code with >>> - larger penalty ( smaller C values) >>> - or larger number of samples >>> yields for me the same coefficients (up to some tolerance). >>> >>> You can also see that SAGA convergence is not good by the fact that it >>> needs 196000 epochs/iterations to converge. >>> >>> Actually, I have often seen convergence issues with SAG on small >>> datasets (in unit tests), not fully sure why. >>> >>> -- >>> Roman >>> >>> On 09/10/2019 22:10, serafim loukas wrote: >>> > The predictions across solver are exactly the same when I run the code. >>> > I am using 0.21.3 version. What is yours? >>> > >>> > >>> > In [13]: import sklearn >>> > >>> > In [14]: sklearn.__version__ >>> > Out[14]: '0.21.3' >>> > >>> > >>> > Serafeim >>> > >>> > >>> > >>> >> On 9 Oct 2019, at 21:44, Beno?t Presles < >>> benoit.presles at u-bourgogne.fr >>> >> > wrote: >>> >> >>> >> (y_pred_lbfgs==y_pred_saga).all() == False >>> > >>> > >>> > _______________________________________________ >>> > scikit-learn mailing list >>> > scikit-learn at python.org >>> > https://mail.python.org/mailman/listinfo/scikit-learn >>> > >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> -- >> Guillaume Lemaitre >> Scikit-learn @ Inria Foundation >> https://glemaitre.github.io/ >> > > > -- > Guillaume Lemaitre > Scikit-learn @ Inria Foundation > https://glemaitre.github.io/ > -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From benoit.presles at u-bourgogne.fr Thu Oct 10 07:14:49 2019 From: benoit.presles at u-bourgogne.fr (=?UTF-8?Q?Beno=c3=aet_Presles?=) Date: Thu, 10 Oct 2019 13:14:49 +0200 Subject: [scikit-learn] logistic regression results are not stable between solvers In-Reply-To: References: <5591ab4c-6a15-2910-c592-0c019b1a6600@u-bourgogne.fr> <44B72247-308C-42A4-B4E1-DFD1BDFC5058@hotmail.com> <586c6024-9bef-3ab8-513d-547913808039@gmail.com> Message-ID: <4d4dc37d-ed57-b512-fcdf-45693ff9e489@u-bourgogne.fr> An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Oct 11 09:42:58 2019 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 11 Oct 2019 15:42:58 +0200 Subject: [scikit-learn] logistic regression results are not stable between solvers In-Reply-To: <4d4dc37d-ed57-b512-fcdf-45693ff9e489@u-bourgogne.fr> References: <5591ab4c-6a15-2910-c592-0c019b1a6600@u-bourgogne.fr> <44B72247-308C-42A4-B4E1-DFD1BDFC5058@hotmail.com> <586c6024-9bef-3ab8-513d-547913808039@gmail.com> <4d4dc37d-ed57-b512-fcdf-45693ff9e489@u-bourgogne.fr> Message-ID: On 10/10/19 1:14 PM, Beno?t Presles wrote: > > Thanks for your answers. > > On my real data, I do not have so many samples. I have a bit more than > 200 samples in total and I also would like to get some results with > unpenalized logisitic regression. > What do you suggest? Should I switch to the lbfgs solver? Yes. > Am I sure that with this solver I will not have any convergence issue > and always get the good result? Indeed, I did not get any convergence > warning with saga, so I thought everything was fine. I noticed some > issues only when I decided to test several solvers. Without comparing > the results across solvers, how to be sure that the optimisation goes > well? Shouldn't scikit-learn warn the user somehow if it is not the case? We should attempt to warn in the SAGA solver if it doesn't converge. That it doesn't raise a convergence warning should probably be considered a bug. It uses the maximum weight change as a stopping criterion right now. We could probably compute the dual objective once in the end to see if we converged, right? Or is that not possible with SAGA? If not, we might want to caution that no convergence warning will be raised. > > At last, I was using saga because I also wanted to do some feature > selection by using l1 penalty which is not supported by lbfgs... You can use liblinear then. > > Best regards, > Ben > > > Le 09/10/2019 ? 23:39, Guillaume Lema?tre a ?crit?: >> Ups I did not see the answer of Roman. Sorry about that. It is coming >> back to the same conclusion :) >> >> On Wed, 9 Oct 2019 at 23:37, Guillaume Lema?tre >> > wrote: >> >> Uhm actually increasing to 10000 samples solve the convergence issue. >> SAGA is not designed to work with a so small sample size most >> probably. >> >> On Wed, 9 Oct 2019 at 23:36, Guillaume Lema?tre >> > wrote: >> >> I slightly change the bench such that it uses pipeline and >> plotted the coefficient: >> >> https://gist.github.com/glemaitre/8fcc24bdfc7dc38ca0c09c56e26b9386 >> >> I only see one of the 10 splits where SAGA is not converging, >> otherwise the coefficients >> look very close (I don't attach the figure here but they can >> be plotted using the snippet). >> So apart from this second split, the other differences seems >> to be numerical instability. >> >> Where I have some concern is regarding the convergence rate >> of SAGA but I have no >> intuition to know if this is normal or not. >> >> On Wed, 9 Oct 2019 at 23:22, Roman Yurchak >> > wrote: >> >> Ben, >> >> I can confirm your results with penalty='none' and C=1e9. >> In both cases, >> you are running a mostly unpenalized logisitic >> regression. Usually >> that's less numerically stable than with a small >> regularization, >> depending on the data collinearity. >> >> Running that same code with >> ? - larger penalty ( smaller C values) >> ? - or larger number of samples >> ? yields for me the same coefficients (up to some tolerance). >> >> You can also see that SAGA convergence is not good by the >> fact that it >> needs 196000 epochs/iterations to converge. >> >> Actually, I have often seen convergence issues with SAG >> on small >> datasets (in unit tests), not fully sure why. >> >> -- >> Roman >> >> On 09/10/2019 22:10, serafim loukas wrote: >> > The predictions across solver are exactly the same when >> I run the code. >> > I am using 0.21.3 version. What is yours? >> > >> > >> > In [13]: import sklearn >> > >> > In [14]: sklearn.__version__ >> > Out[14]: '0.21.3' >> > >> > >> > Serafeim >> > >> > >> > >> >> On 9 Oct 2019, at 21:44, Beno?t Presles >> > >> >> > >> wrote: >> >> >> >> (y_pred_lbfgs==y_pred_saga).all() == False >> > >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> -- >> Guillaume Lemaitre >> Scikit-learn @ Inria Foundation >> https://glemaitre.github.io/ >> >> >> >> -- >> Guillaume Lemaitre >> Scikit-learn @ Inria Foundation >> https://glemaitre.github.io/ >> >> >> >> -- >> Guillaume Lemaitre >> Scikit-learn @ Inria Foundation >> https://glemaitre.github.io/ >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From javaeurusd at gmail.com Fri Oct 11 13:04:55 2019 From: javaeurusd at gmail.com (Mike Smith) Date: Fri, 11 Oct 2019 10:04:55 -0700 Subject: [scikit-learn] scikit supervised learning order In-Reply-To: References: Message-ID: I see that the list of regressors ordered at https://scikit-learn.org/stable/supervised_learning.html#supervised-learning seems to be from simplest to most complex regressor. For example, 1.6 KNearest Neighbor and 1.10 Decision Tree, are clearly more basic models than 1.11 Ensemble methods, and worst results are expected from 1.6,1.10 vs 1.11. So, does this mean that the ordering 1.1-1-17 signifies 1.1 are weaker models than 1.17, in that order? And that 1.17 Neural Networks is presumed to give the best results? On Fri, Oct 11, 2019 at 9:02 AM wrote: > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Re: logistic regression results are not stable between > solvers (Andreas Mueller) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 11 Oct 2019 15:42:58 +0200 > From: Andreas Mueller > To: scikit-learn at python.org > Subject: Re: [scikit-learn] logistic regression results are not stable > between solvers > Message-ID: > Content-Type: text/plain; charset="utf-8"; Format="flowed" > > > > On 10/10/19 1:14 PM, Beno?t Presles wrote: > > > > Thanks for your answers. > > > > On my real data, I do not have so many samples. I have a bit more than > > 200 samples in total and I also would like to get some results with > > unpenalized logisitic regression. > > What do you suggest? Should I switch to the lbfgs solver? > Yes. > > Am I sure that with this solver I will not have any convergence issue > > and always get the good result? Indeed, I did not get any convergence > > warning with saga, so I thought everything was fine. I noticed some > > issues only when I decided to test several solvers. Without comparing > > the results across solvers, how to be sure that the optimisation goes > > well? Shouldn't scikit-learn warn the user somehow if it is not the case? > We should attempt to warn in the SAGA solver if it doesn't converge. > That it doesn't raise a convergence warning should probably be > considered a bug. > It uses the maximum weight change as a stopping criterion right now. > We could probably compute the dual objective once in the end to see if > we converged, right? Or is that not possible with SAGA? If not, we might > want to caution that no convergence warning will be raised. > > > > > At last, I was using saga because I also wanted to do some feature > > selection by using l1 penalty which is not supported by lbfgs... > You can use liblinear then. > > > > > > Best regards, > > Ben > > > > > > Le 09/10/2019 ? 23:39, Guillaume Lema?tre a ?crit?: > >> Ups I did not see the answer of Roman. Sorry about that. It is coming > >> back to the same conclusion :) > >> > >> On Wed, 9 Oct 2019 at 23:37, Guillaume Lema?tre > >> > wrote: > >> > >> Uhm actually increasing to 10000 samples solve the convergence > issue. > >> SAGA is not designed to work with a so small sample size most > >> probably. > >> > >> On Wed, 9 Oct 2019 at 23:36, Guillaume Lema?tre > >> > wrote: > >> > >> I slightly change the bench such that it uses pipeline and > >> plotted the coefficient: > >> > >> > https://gist.github.com/glemaitre/8fcc24bdfc7dc38ca0c09c56e26b9386 > >> > >> I only see one of the 10 splits where SAGA is not converging, > >> otherwise the coefficients > >> look very close (I don't attach the figure here but they can > >> be plotted using the snippet). > >> So apart from this second split, the other differences seems > >> to be numerical instability. > >> > >> Where I have some concern is regarding the convergence rate > >> of SAGA but I have no > >> intuition to know if this is normal or not. > >> > >> On Wed, 9 Oct 2019 at 23:22, Roman Yurchak > >> > wrote: > >> > >> Ben, > >> > >> I can confirm your results with penalty='none' and C=1e9. > >> In both cases, > >> you are running a mostly unpenalized logisitic > >> regression. Usually > >> that's less numerically stable than with a small > >> regularization, > >> depending on the data collinearity. > >> > >> Running that same code with > >> ? - larger penalty ( smaller C values) > >> ? - or larger number of samples > >> ? yields for me the same coefficients (up to some > tolerance). > >> > >> You can also see that SAGA convergence is not good by the > >> fact that it > >> needs 196000 epochs/iterations to converge. > >> > >> Actually, I have often seen convergence issues with SAG > >> on small > >> datasets (in unit tests), not fully sure why. > >> > >> -- > >> Roman > >> > >> On 09/10/2019 22:10, serafim loukas wrote: > >> > The predictions across solver are exactly the same when > >> I run the code. > >> > I am using 0.21.3 version. What is yours? > >> > > >> > > >> > In [13]: import sklearn > >> > > >> > In [14]: sklearn.__version__ > >> > Out[14]: '0.21.3' > >> > > >> > > >> > Serafeim > >> > > >> > > >> > > >> >> On 9 Oct 2019, at 21:44, Beno?t Presles > >> >> > >> >> >> >> wrote: > >> >> > >> >> (y_pred_lbfgs==y_pred_saga).all() == False > >> > > >> > > >> > _______________________________________________ > >> > scikit-learn mailing list > >> > scikit-learn at python.org > >> > https://mail.python.org/mailman/listinfo/scikit-learn > >> > > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> > >> > >> -- > >> Guillaume Lemaitre > >> Scikit-learn @ Inria Foundation > >> https://glemaitre.github.io/ > >> > >> > >> > >> -- > >> Guillaume Lemaitre > >> Scikit-learn @ Inria Foundation > >> https://glemaitre.github.io/ > >> > >> > >> > >> -- > >> Guillaume Lemaitre > >> Scikit-learn @ Inria Foundation > >> https://glemaitre.github.io/ > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://mail.python.org/pipermail/scikit-learn/attachments/20191011/a7052cd9/attachment-0001.html > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 43, Issue 21 > ******************************************** > -------------- next part -------------- An HTML attachment was scrubbed... URL: From javaeurusd at gmail.com Fri Oct 11 13:10:32 2019 From: javaeurusd at gmail.com (Mike Smith) Date: Fri, 11 Oct 2019 10:10:32 -0700 Subject: [scikit-learn] Is scikit-learn implying neural nets are the best regressor? In-Reply-To: References: Message-ID: In other words, according to that arrangement, is scikit-learn implying that section 1.17 is the best regressor out of the listed, 1.1 to 1.17? If yes, I'd like to know if i can run the model on standard pc cpu ram and should still expect good results or do I need cloud hardware? If I should expect good results on a pc, scikit says that needing gpu power is obsolete, since certain scikit models perform better (than ml designed for gpu) that are not designed for gpu, for that reason. Is this true? How much hardware is a practical expectation for running the best scikit models and getting the best results? On Fri, Oct 11, 2019 at 9:02 AM wrote: > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Re: logistic regression results are not stable between > solvers (Andreas Mueller) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 11 Oct 2019 15:42:58 +0200 > From: Andreas Mueller > To: scikit-learn at python.org > Subject: Re: [scikit-learn] logistic regression results are not stable > between solvers > Message-ID: > Content-Type: text/plain; charset="utf-8"; Format="flowed" > > > > On 10/10/19 1:14 PM, Beno?t Presles wrote: > > > > Thanks for your answers. > > > > On my real data, I do not have so many samples. I have a bit more than > > 200 samples in total and I also would like to get some results with > > unpenalized logisitic regression. > > What do you suggest? Should I switch to the lbfgs solver? > Yes. > > Am I sure that with this solver I will not have any convergence issue > > and always get the good result? Indeed, I did not get any convergence > > warning with saga, so I thought everything was fine. I noticed some > > issues only when I decided to test several solvers. Without comparing > > the results across solvers, how to be sure that the optimisation goes > > well? Shouldn't scikit-learn warn the user somehow if it is not the case? > We should attempt to warn in the SAGA solver if it doesn't converge. > That it doesn't raise a convergence warning should probably be > considered a bug. > It uses the maximum weight change as a stopping criterion right now. > We could probably compute the dual objective once in the end to see if > we converged, right? Or is that not possible with SAGA? If not, we might > want to caution that no convergence warning will be raised. > > > > > At last, I was using saga because I also wanted to do some feature > > selection by using l1 penalty which is not supported by lbfgs... > You can use liblinear then. > > > > > > Best regards, > > Ben > > > > > > Le 09/10/2019 ? 23:39, Guillaume Lema?tre a ?crit?: > >> Ups I did not see the answer of Roman. Sorry about that. It is coming > >> back to the same conclusion :) > >> > >> On Wed, 9 Oct 2019 at 23:37, Guillaume Lema?tre > >> > wrote: > >> > >> Uhm actually increasing to 10000 samples solve the convergence > issue. > >> SAGA is not designed to work with a so small sample size most > >> probably. > >> > >> On Wed, 9 Oct 2019 at 23:36, Guillaume Lema?tre > >> > wrote: > >> > >> I slightly change the bench such that it uses pipeline and > >> plotted the coefficient: > >> > >> > https://gist.github.com/glemaitre/8fcc24bdfc7dc38ca0c09c56e26b9386 > >> > >> I only see one of the 10 splits where SAGA is not converging, > >> otherwise the coefficients > >> look very close (I don't attach the figure here but they can > >> be plotted using the snippet). > >> So apart from this second split, the other differences seems > >> to be numerical instability. > >> > >> Where I have some concern is regarding the convergence rate > >> of SAGA but I have no > >> intuition to know if this is normal or not. > >> > >> On Wed, 9 Oct 2019 at 23:22, Roman Yurchak > >> > wrote: > >> > >> Ben, > >> > >> I can confirm your results with penalty='none' and C=1e9. > >> In both cases, > >> you are running a mostly unpenalized logisitic > >> regression. Usually > >> that's less numerically stable than with a small > >> regularization, > >> depending on the data collinearity. > >> > >> Running that same code with > >> ? - larger penalty ( smaller C values) > >> ? - or larger number of samples > >> ? yields for me the same coefficients (up to some > tolerance). > >> > >> You can also see that SAGA convergence is not good by the > >> fact that it > >> needs 196000 epochs/iterations to converge. > >> > >> Actually, I have often seen convergence issues with SAG > >> on small > >> datasets (in unit tests), not fully sure why. > >> > >> -- > >> Roman > >> > >> On 09/10/2019 22:10, serafim loukas wrote: > >> > The predictions across solver are exactly the same when > >> I run the code. > >> > I am using 0.21.3 version. What is yours? > >> > > >> > > >> > In [13]: import sklearn > >> > > >> > In [14]: sklearn.__version__ > >> > Out[14]: '0.21.3' > >> > > >> > > >> > Serafeim > >> > > >> > > >> > > >> >> On 9 Oct 2019, at 21:44, Beno?t Presles > >> >> > >> >> >> >> wrote: > >> >> > >> >> (y_pred_lbfgs==y_pred_saga).all() == False > >> > > >> > > >> > _______________________________________________ > >> > scikit-learn mailing list > >> > scikit-learn at python.org > >> > https://mail.python.org/mailman/listinfo/scikit-learn > >> > > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> > >> > >> -- > >> Guillaume Lemaitre > >> Scikit-learn @ Inria Foundation > >> https://glemaitre.github.io/ > >> > >> > >> > >> -- > >> Guillaume Lemaitre > >> Scikit-learn @ Inria Foundation > >> https://glemaitre.github.io/ > >> > >> > >> > >> -- > >> Guillaume Lemaitre > >> Scikit-learn @ Inria Foundation > >> https://glemaitre.github.io/ > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://mail.python.org/pipermail/scikit-learn/attachments/20191011/a7052cd9/attachment-0001.html > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 43, Issue 21 > ******************************************** > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Fri Oct 11 13:34:33 2019 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Fri, 11 Oct 2019 13:34:33 -0400 Subject: [scikit-learn] Is scikit-learn implying neural nets are the best regressor? In-Reply-To: References: Message-ID: <20191011173433.bbywiqnwjjpvsi4r@phare.normalesup.org> On Fri, Oct 11, 2019 at 10:10:32AM -0700, Mike Smith wrote: > In other words, according to that arrangement, is scikit-learn implying that > section 1.17 is the best regressor out of the listed, 1.1 to 1.17? No. First they are not ordered in order of complexity (Naive Bayes is arguably simpler than Gaussian Processes). Second complexity does not imply better prediction. > If I should expect good results on a pc, scikit says that needing gpu power is > obsolete, since certain scikit models perform better (than ml designed for gpu) > that are not designed for gpu, for that reason. Is this true? Where do you see this written? I think that you are looking for overly simple stories that you are not true. > How much hardware is a practical expectation for running the best > scikit models and getting the best results? This is too vague a question for which there is no answer. Ga?l > On Fri, Oct 11, 2019 at 9:02 AM wrote: > Send scikit-learn mailing list submissions to > ? ? ? ? scikit-learn at python.org > To subscribe or unsubscribe via the World Wide Web, visit > ? ? ? ? https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > ? ? ? ? scikit-learn-request at python.org > You can reach the person managing the list at > ? ? ? ? scikit-learn-owner at python.org > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > Today's Topics: > ? ?1. Re: logistic regression results are not stable between > ? ? ? solvers (Andreas Mueller) > ---------------------------------------------------------------------- > Message: 1 > Date: Fri, 11 Oct 2019 15:42:58 +0200 > From: Andreas Mueller > To: scikit-learn at python.org > Subject: Re: [scikit-learn] logistic regression results are not stable > ? ? ? ? between solvers > Message-ID: > Content-Type: text/plain; charset="utf-8"; Format="flowed" > On 10/10/19 1:14 PM, Beno?t Presles wrote: > > Thanks for your answers. > > On my real data, I do not have so many samples. I have a bit more than > > 200 samples in total and I also would like to get some results with > > unpenalized logisitic regression. > > What do you suggest? Should I switch to the lbfgs solver? > Yes. > > Am I sure that with this solver I will not have any convergence issue > > and always get the good result? Indeed, I did not get any convergence > > warning with saga, so I thought everything was fine. I noticed some > > issues only when I decided to test several solvers. Without comparing > > the results across solvers, how to be sure that the optimisation goes > > well? Shouldn't scikit-learn warn the user somehow if it is not the case? > We should attempt to warn in the SAGA solver if it doesn't converge. > That it doesn't raise a convergence warning should probably be > considered a bug. > It uses the maximum weight change as a stopping criterion right now. > We could probably compute the dual objective once in the end to see if > we converged, right? Or is that not possible with SAGA? If not, we might > want to caution that no convergence warning will be raised. > > At last, I was using saga because I also wanted to do some feature > > selection by using l1 penalty which is not supported by lbfgs... > You can use liblinear then. > > Best regards, > > Ben > > Le 09/10/2019 ? 23:39, Guillaume Lema?tre a ?crit?: > >> Ups I did not see the answer of Roman. Sorry about that. It is coming > >> back to the same conclusion :) > >> On Wed, 9 Oct 2019 at 23:37, Guillaume Lema?tre > >> > wrote: > >>? ? ?Uhm actually increasing to 10000 samples solve the convergence > issue. > >>? ? ?SAGA is not designed to work with a so small sample size most > >>? ? ?probably. > >>? ? ?On Wed, 9 Oct 2019 at 23:36, Guillaume Lema?tre > >>? ? ?> wrote: > >>? ? ? ? ?I slightly change the bench such that it uses pipeline and > >>? ? ? ? ?plotted the coefficient: > >>? ? ? ? ?https://gist.github.com/glemaitre/ > 8fcc24bdfc7dc38ca0c09c56e26b9386 > >>? ? ? ? ?I only see one of the 10 splits where SAGA is not converging, > >>? ? ? ? ?otherwise the coefficients > >>? ? ? ? ?look very close (I don't attach the figure here but they can > >>? ? ? ? ?be plotted using the snippet). > >>? ? ? ? ?So apart from this second split, the other differences seems > >>? ? ? ? ?to be numerical instability. > >>? ? ? ? ?Where I have some concern is regarding the convergence rate > >>? ? ? ? ?of SAGA but I have no > >>? ? ? ? ?intuition to know if this is normal or not. > >>? ? ? ? ?On Wed, 9 Oct 2019 at 23:22, Roman Yurchak > >>? ? ? ? ?> wrote: > >>? ? ? ? ? ? ?Ben, > >>? ? ? ? ? ? ?I can confirm your results with penalty='none' and C=1e9. > >>? ? ? ? ? ? ?In both cases, > >>? ? ? ? ? ? ?you are running a mostly unpenalized logisitic > >>? ? ? ? ? ? ?regression. Usually > >>? ? ? ? ? ? ?that's less numerically stable than with a small > >>? ? ? ? ? ? ?regularization, > >>? ? ? ? ? ? ?depending on the data collinearity. > >>? ? ? ? ? ? ?Running that same code with > >>? ? ? ? ? ? ?? - larger penalty ( smaller C values) > >>? ? ? ? ? ? ?? - or larger number of samples > >>? ? ? ? ? ? ?? yields for me the same coefficients (up to some > tolerance). > >>? ? ? ? ? ? ?You can also see that SAGA convergence is not good by the > >>? ? ? ? ? ? ?fact that it > >>? ? ? ? ? ? ?needs 196000 epochs/iterations to converge. > >>? ? ? ? ? ? ?Actually, I have often seen convergence issues with SAG > >>? ? ? ? ? ? ?on small > >>? ? ? ? ? ? ?datasets (in unit tests), not fully sure why. > >>? ? ? ? ? ? ?-- > >>? ? ? ? ? ? ?Roman > >>? ? ? ? ? ? ?On 09/10/2019 22:10, serafim loukas wrote: > >>? ? ? ? ? ? ?> The predictions across solver are exactly the same when > >>? ? ? ? ? ? ?I run the code. > >>? ? ? ? ? ? ?> I am using 0.21.3 version. What is yours? > >>? ? ? ? ? ? ?> > >>? ? ? ? ? ? ?> > >>? ? ? ? ? ? ?> In [13]: import sklearn > >>? ? ? ? ? ? ?> > >>? ? ? ? ? ? ?> In [14]: sklearn.__version__ > >>? ? ? ? ? ? ?> Out[14]: '0.21.3' > >>? ? ? ? ? ? ?> > >>? ? ? ? ? ? ?> > >>? ? ? ? ? ? ?> Serafeim > >>? ? ? ? ? ? ?> > >>? ? ? ? ? ? ?> > >>? ? ? ? ? ? ?> > >>? ? ? ? ? ? ?>> On 9 Oct 2019, at 21:44, Beno?t Presles > >>? ? ? ? ? ? ? >>? ? ? ? ? ? ? > >>? ? ? ? ? ? ?>> >>? ? ? ? ? ? ?>> wrote: > >>? ? ? ? ? ? ?>> > >>? ? ? ? ? ? ?>> (y_pred_lbfgs==y_pred_saga).all() == False > >>? ? ? ? ? ? ?> > >>? ? ? ? ? ? ?> > >>? ? ? ? ? ? ?> _______________________________________________ > >>? ? ? ? ? ? ?> scikit-learn mailing list > >>? ? ? ? ? ? ?> scikit-learn at python.org > >>? ? ? ? ? ? ?> https://mail.python.org/mailman/listinfo/scikit-learn > >>? ? ? ? ? ? ?> > >>? ? ? ? ? ? ?_______________________________________________ > >>? ? ? ? ? ? ?scikit-learn mailing list > >>? ? ? ? ? ? ?scikit-learn at python.org > >>? ? ? ? ? ? ?https://mail.python.org/mailman/listinfo/scikit-learn > >>? ? ? ? ?-- > >>? ? ? ? ?Guillaume Lemaitre > >>? ? ? ? ?Scikit-learn @ Inria Foundation > >>? ? ? ? ?https://glemaitre.github.io/ > >>? ? ?-- > >>? ? ?Guillaume Lemaitre > >>? ? ?Scikit-learn @ Inria Foundation > >>? ? ?https://glemaitre.github.io/ > >> -- > >> Guillaume Lemaitre > >> Scikit-learn @ Inria Foundation > >> https://glemaitre.github.io/ > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: a7052cd9/attachment-0001.html> > ------------------------------ > Subject: Digest Footer > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > ------------------------------ > End of scikit-learn Digest, Vol 43, Issue 21 > ******************************************** > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Research Director, INRIA Visiting professor, McGill http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From javaeurusd at gmail.com Sat Oct 12 17:04:12 2019 From: javaeurusd at gmail.com (Mike Smith) Date: Sat, 12 Oct 2019 14:04:12 -0700 Subject: [scikit-learn] scikit-learn Digest, Vol 43, Issue 24 In-Reply-To: References: Message-ID: "... > If I should expect good results on a pc, scikit says that needing gpu power is > obsolete, since certain scikit models perform better (than ml designed for gpu) > that are not designed for gpu, for that reason. Is this true?" Where do you see this written? I think that you are looking for overly simple stories that you are not true." Gael, see the below from the scikit-learn FAQ. You can also find this yourself at the main FAQ: [image: 2019-10-12 14_00_05-Frequently Asked Questions ? scikit-learn 0.21.3 documentation.png] On Sat, Oct 12, 2019 at 9:03 AM wrote: > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Re: Is scikit-learn implying neural nets are the best > regressor? (Gael Varoquaux) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 11 Oct 2019 13:34:33 -0400 > From: Gael Varoquaux > To: Scikit-learn mailing list > Subject: Re: [scikit-learn] Is scikit-learn implying neural nets are > the best regressor? > Message-ID: <20191011173433.bbywiqnwjjpvsi4r at phare.normalesup.org> > Content-Type: text/plain; charset=iso-8859-1 > > On Fri, Oct 11, 2019 at 10:10:32AM -0700, Mike Smith wrote: > > In other words, according to that arrangement, is scikit-learn implying > that > > section 1.17 is the best regressor out of the listed, 1.1 to 1.17? > > No. > > First they are not ordered in order of complexity (Naive Bayes is > arguably simpler than Gaussian Processes). Second complexity does not > imply better prediction. > > > If I should expect good results on a pc, scikit says that needing gpu > power is > > obsolete, since certain scikit models perform better (than ml designed > for gpu) > > that are not designed for gpu, for that reason. Is this true? > > Where do you see this written? I think that you are looking for overly > simple stories that you are not true. > > > How much hardware is a practical expectation for running the best > > scikit models and getting the best results? > > This is too vague a question for which there is no answer. > > Ga?l > > > On Fri, Oct 11, 2019 at 9:02 AM wrote: > > > Send scikit-learn mailing list submissions to > > ? ? ? ? scikit-learn at python.org > > > To subscribe or unsubscribe via the World Wide Web, visit > > ? ? ? ? https://mail.python.org/mailman/listinfo/scikit-learn > > or, via email, send a message with subject or body 'help' to > > ? ? ? ? scikit-learn-request at python.org > > > You can reach the person managing the list at > > ? ? ? ? scikit-learn-owner at python.org > > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of scikit-learn digest..." > > > > Today's Topics: > > > ? ?1. Re: logistic regression results are not stable between > > ? ? ? solvers (Andreas Mueller) > > > > > ---------------------------------------------------------------------- > > > Message: 1 > > Date: Fri, 11 Oct 2019 15:42:58 +0200 > > From: Andreas Mueller > > To: scikit-learn at python.org > > Subject: Re: [scikit-learn] logistic regression results are not > stable > > ? ? ? ? between solvers > > Message-ID: > > Content-Type: text/plain; charset="utf-8"; Format="flowed" > > > > > On 10/10/19 1:14 PM, Beno?t Presles wrote: > > > > Thanks for your answers. > > > > On my real data, I do not have so many samples. I have a bit more > than > > > 200 samples in total and I also would like to get some results with > > > unpenalized logisitic regression. > > > What do you suggest? Should I switch to the lbfgs solver? > > Yes. > > > Am I sure that with this solver I will not have any convergence > issue > > > and always get the good result? Indeed, I did not get any > convergence > > > warning with saga, so I thought everything was fine. I noticed some > > > issues only when I decided to test several solvers. Without > comparing > > > the results across solvers, how to be sure that the optimisation > goes > > > well? Shouldn't scikit-learn warn the user somehow if it is not > the case? > > We should attempt to warn in the SAGA solver if it doesn't converge. > > That it doesn't raise a convergence warning should probably be > > considered a bug. > > It uses the maximum weight change as a stopping criterion right now. > > We could probably compute the dual objective once in the end to see > if > > we converged, right? Or is that not possible with SAGA? If not, we > might > > want to caution that no convergence warning will be raised. > > > > > At last, I was using saga because I also wanted to do some feature > > > selection by using l1 penalty which is not supported by lbfgs... > > You can use liblinear then. > > > > > > Best regards, > > > Ben > > > > > Le 09/10/2019 ? 23:39, Guillaume Lema?tre a ?crit?: > > >> Ups I did not see the answer of Roman. Sorry about that. It is > coming > > >> back to the same conclusion :) > > > >> On Wed, 9 Oct 2019 at 23:37, Guillaume Lema?tre > > >> > wrote: > > > >>? ? ?Uhm actually increasing to 10000 samples solve the convergence > > issue. > > >>? ? ?SAGA is not designed to work with a so small sample size most > > >>? ? ?probably. > > > >>? ? ?On Wed, 9 Oct 2019 at 23:36, Guillaume Lema?tre > > >>? ? ?> > wrote: > > > >>? ? ? ? ?I slightly change the bench such that it uses pipeline and > > >>? ? ? ? ?plotted the coefficient: > > > >>? ? ? ? ?https://gist.github.com/glemaitre/ > > 8fcc24bdfc7dc38ca0c09c56e26b9386 > > > >>? ? ? ? ?I only see one of the 10 splits where SAGA is not > converging, > > >>? ? ? ? ?otherwise the coefficients > > >>? ? ? ? ?look very close (I don't attach the figure here but they > can > > >>? ? ? ? ?be plotted using the snippet). > > >>? ? ? ? ?So apart from this second split, the other differences > seems > > >>? ? ? ? ?to be numerical instability. > > > >>? ? ? ? ?Where I have some concern is regarding the convergence > rate > > >>? ? ? ? ?of SAGA but I have no > > >>? ? ? ? ?intuition to know if this is normal or not. > > > >>? ? ? ? ?On Wed, 9 Oct 2019 at 23:22, Roman Yurchak > > >>? ? ? ? ?> > wrote: > > > >>? ? ? ? ? ? ?Ben, > > > >>? ? ? ? ? ? ?I can confirm your results with penalty='none' and > C=1e9. > > >>? ? ? ? ? ? ?In both cases, > > >>? ? ? ? ? ? ?you are running a mostly unpenalized logisitic > > >>? ? ? ? ? ? ?regression. Usually > > >>? ? ? ? ? ? ?that's less numerically stable than with a small > > >>? ? ? ? ? ? ?regularization, > > >>? ? ? ? ? ? ?depending on the data collinearity. > > > >>? ? ? ? ? ? ?Running that same code with > > >>? ? ? ? ? ? ?? - larger penalty ( smaller C values) > > >>? ? ? ? ? ? ?? - or larger number of samples > > >>? ? ? ? ? ? ?? yields for me the same coefficients (up to some > > tolerance). > > > >>? ? ? ? ? ? ?You can also see that SAGA convergence is not good by > the > > >>? ? ? ? ? ? ?fact that it > > >>? ? ? ? ? ? ?needs 196000 epochs/iterations to converge. > > > >>? ? ? ? ? ? ?Actually, I have often seen convergence issues with > SAG > > >>? ? ? ? ? ? ?on small > > >>? ? ? ? ? ? ?datasets (in unit tests), not fully sure why. > > > >>? ? ? ? ? ? ?-- > > >>? ? ? ? ? ? ?Roman > > > >>? ? ? ? ? ? ?On 09/10/2019 22:10, serafim loukas wrote: > > >>? ? ? ? ? ? ?> The predictions across solver are exactly the same > when > > >>? ? ? ? ? ? ?I run the code. > > >>? ? ? ? ? ? ?> I am using 0.21.3 version. What is yours? > > >>? ? ? ? ? ? ?> > > >>? ? ? ? ? ? ?> > > >>? ? ? ? ? ? ?> In [13]: import sklearn > > >>? ? ? ? ? ? ?> > > >>? ? ? ? ? ? ?> In [14]: sklearn.__version__ > > >>? ? ? ? ? ? ?> Out[14]: '0.21.3' > > >>? ? ? ? ? ? ?> > > >>? ? ? ? ? ? ?> > > >>? ? ? ? ? ? ?> Serafeim > > >>? ? ? ? ? ? ?> > > >>? ? ? ? ? ? ?> > > >>? ? ? ? ? ? ?> > > >>? ? ? ? ? ? ?>> On 9 Oct 2019, at 21:44, Beno?t Presles > > >>? ? ? ? ? ? ? > >>? ? ? ? ? ? ? > > >>? ? ? ? ? ? ?>> > >>? ? ? ? ? ? ?>> wrote: > > >>? ? ? ? ? ? ?>> > > >>? ? ? ? ? ? ?>> (y_pred_lbfgs==y_pred_saga).all() == False > > >>? ? ? ? ? ? ?> > > >>? ? ? ? ? ? ?> > > >>? ? ? ? ? ? ?> _______________________________________________ > > >>? ? ? ? ? ? ?> scikit-learn mailing list > > >>? ? ? ? ? ? ?> scikit-learn at python.org scikit-learn at python.org> > > >>? ? ? ? ? ? ?> > https://mail.python.org/mailman/listinfo/scikit-learn > > >>? ? ? ? ? ? ?> > > > >>? ? ? ? ? ? ?_______________________________________________ > > >>? ? ? ? ? ? ?scikit-learn mailing list > > >>? ? ? ? ? ? ?scikit-learn at python.org scikit-learn at python.org> > > >>? ? ? ? ? ? ?https://mail.python.org/mailman/listinfo/scikit-learn > > > > > >>? ? ? ? ?-- > > >>? ? ? ? ?Guillaume Lemaitre > > >>? ? ? ? ?Scikit-learn @ Inria Foundation > > >>? ? ? ? ?https://glemaitre.github.io/ > > > > > >>? ? ?-- > > >>? ? ?Guillaume Lemaitre > > >>? ? ?Scikit-learn @ Inria Foundation > > >>? ? ?https://glemaitre.github.io/ > > > > > >> -- > > >> Guillaume Lemaitre > > >> Scikit-learn @ Inria Foundation > > >> https://glemaitre.github.io/ > > > >> _______________________________________________ > > >> scikit-learn mailing list > > >> scikit-learn at python.org > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -------------- next part -------------- > > An HTML attachment was scrubbed... > > URL: < > http://mail.python.org/pipermail/scikit-learn/attachments/20191011/ > > a7052cd9/attachment-0001.html> > > > ------------------------------ > > > Subject: Digest Footer > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > ------------------------------ > > > End of scikit-learn Digest, Vol 43, Issue 21 > > ******************************************** > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > Gael Varoquaux > Research Director, INRIA Visiting professor, McGill > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 43, Issue 24 > ******************************************** > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 2019-10-12 14_00_05-Frequently Asked Questions ? scikit-learn 0.21.3 documentation.png Type: image/png Size: 26245 bytes Desc: not available URL: From javaeurusd at gmail.com Sat Oct 12 17:06:51 2019 From: javaeurusd at gmail.com (Mike Smith) Date: Sat, 12 Oct 2019 14:06:51 -0700 Subject: [scikit-learn] scikit-learn Digest, Vol 43, Issue 24 In-Reply-To: References: Message-ID: Gael, simply because you're not able to or willing to answer the question doesn't mean there is no practical answer for it. "...> How much hardware is a practical expectation for running the best > scikit models and getting the best results? This is too vague a question for which there is no answer. Ga?l " On Sat, Oct 12, 2019 at 9:03 AM wrote: > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Re: Is scikit-learn implying neural nets are the best > regressor? (Gael Varoquaux) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 11 Oct 2019 13:34:33 -0400 > From: Gael Varoquaux > To: Scikit-learn mailing list > Subject: Re: [scikit-learn] Is scikit-learn implying neural nets are > the best regressor? > Message-ID: <20191011173433.bbywiqnwjjpvsi4r at phare.normalesup.org> > Content-Type: text/plain; charset=iso-8859-1 > > On Fri, Oct 11, 2019 at 10:10:32AM -0700, Mike Smith wrote: > > In other words, according to that arrangement, is scikit-learn implying > that > > section 1.17 is the best regressor out of the listed, 1.1 to 1.17? > > No. > > First they are not ordered in order of complexity (Naive Bayes is > arguably simpler than Gaussian Processes). Second complexity does not > imply better prediction. > > > If I should expect good results on a pc, scikit says that needing gpu > power is > > obsolete, since certain scikit models perform better (than ml designed > for gpu) > > that are not designed for gpu, for that reason. Is this true? > > Where do you see this written? I think that you are looking for overly > simple stories that you are not true. > > > How much hardware is a practical expectation for running the best > > scikit models and getting the best results? > > This is too vague a question for which there is no answer. > > Ga?l > > > On Fri, Oct 11, 2019 at 9:02 AM wrote: > > > Send scikit-learn mailing list submissions to > > ? ? ? ? scikit-learn at python.org > > > To subscribe or unsubscribe via the World Wide Web, visit > > ? ? ? ? https://mail.python.org/mailman/listinfo/scikit-learn > > or, via email, send a message with subject or body 'help' to > > ? ? ? ? scikit-learn-request at python.org > > > You can reach the person managing the list at > > ? ? ? ? scikit-learn-owner at python.org > > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of scikit-learn digest..." > > > > Today's Topics: > > > ? ?1. Re: logistic regression results are not stable between > > ? ? ? solvers (Andreas Mueller) > > > > > ---------------------------------------------------------------------- > > > Message: 1 > > Date: Fri, 11 Oct 2019 15:42:58 +0200 > > From: Andreas Mueller > > To: scikit-learn at python.org > > Subject: Re: [scikit-learn] logistic regression results are not > stable > > ? ? ? ? between solvers > > Message-ID: > > Content-Type: text/plain; charset="utf-8"; Format="flowed" > > > > > On 10/10/19 1:14 PM, Beno?t Presles wrote: > > > > Thanks for your answers. > > > > On my real data, I do not have so many samples. I have a bit more > than > > > 200 samples in total and I also would like to get some results with > > > unpenalized logisitic regression. > > > What do you suggest? Should I switch to the lbfgs solver? > > Yes. > > > Am I sure that with this solver I will not have any convergence > issue > > > and always get the good result? Indeed, I did not get any > convergence > > > warning with saga, so I thought everything was fine. I noticed some > > > issues only when I decided to test several solvers. Without > comparing > > > the results across solvers, how to be sure that the optimisation > goes > > > well? Shouldn't scikit-learn warn the user somehow if it is not > the case? > > We should attempt to warn in the SAGA solver if it doesn't converge. > > That it doesn't raise a convergence warning should probably be > > considered a bug. > > It uses the maximum weight change as a stopping criterion right now. > > We could probably compute the dual objective once in the end to see > if > > we converged, right? Or is that not possible with SAGA? If not, we > might > > want to caution that no convergence warning will be raised. > > > > > At last, I was using saga because I also wanted to do some feature > > > selection by using l1 penalty which is not supported by lbfgs... > > You can use liblinear then. > > > > > > Best regards, > > > Ben > > > > > Le 09/10/2019 ? 23:39, Guillaume Lema?tre a ?crit?: > > >> Ups I did not see the answer of Roman. Sorry about that. It is > coming > > >> back to the same conclusion :) > > > >> On Wed, 9 Oct 2019 at 23:37, Guillaume Lema?tre > > >> > wrote: > > > >>? ? ?Uhm actually increasing to 10000 samples solve the convergence > > issue. > > >>? ? ?SAGA is not designed to work with a so small sample size most > > >>? ? ?probably. > > > >>? ? ?On Wed, 9 Oct 2019 at 23:36, Guillaume Lema?tre > > >>? ? ?> > wrote: > > > >>? ? ? ? ?I slightly change the bench such that it uses pipeline and > > >>? ? ? ? ?plotted the coefficient: > > > >>? ? ? ? ?https://gist.github.com/glemaitre/ > > 8fcc24bdfc7dc38ca0c09c56e26b9386 > > > >>? ? ? ? ?I only see one of the 10 splits where SAGA is not > converging, > > >>? ? ? ? ?otherwise the coefficients > > >>? ? ? ? ?look very close (I don't attach the figure here but they > can > > >>? ? ? ? ?be plotted using the snippet). > > >>? ? ? ? ?So apart from this second split, the other differences > seems > > >>? ? ? ? ?to be numerical instability. > > > >>? ? ? ? ?Where I have some concern is regarding the convergence > rate > > >>? ? ? ? ?of SAGA but I have no > > >>? ? ? ? ?intuition to know if this is normal or not. > > > >>? ? ? ? ?On Wed, 9 Oct 2019 at 23:22, Roman Yurchak > > >>? ? ? ? ?> > wrote: > > > >>? ? ? ? ? ? ?Ben, > > > >>? ? ? ? ? ? ?I can confirm your results with penalty='none' and > C=1e9. > > >>? ? ? ? ? ? ?In both cases, > > >>? ? ? ? ? ? ?you are running a mostly unpenalized logisitic > > >>? ? ? ? ? ? ?regression. Usually > > >>? ? ? ? ? ? ?that's less numerically stable than with a small > > >>? ? ? ? ? ? ?regularization, > > >>? ? ? ? ? ? ?depending on the data collinearity. > > > >>? ? ? ? ? ? ?Running that same code with > > >>? ? ? ? ? ? ?? - larger penalty ( smaller C values) > > >>? ? ? ? ? ? ?? - or larger number of samples > > >>? ? ? ? ? ? ?? yields for me the same coefficients (up to some > > tolerance). > > > >>? ? ? ? ? ? ?You can also see that SAGA convergence is not good by > the > > >>? ? ? ? ? ? ?fact that it > > >>? ? ? ? ? ? ?needs 196000 epochs/iterations to converge. > > > >>? ? ? ? ? ? ?Actually, I have often seen convergence issues with > SAG > > >>? ? ? ? ? ? ?on small > > >>? ? ? ? ? ? ?datasets (in unit tests), not fully sure why. > > > >>? ? ? ? ? ? ?-- > > >>? ? ? ? ? ? ?Roman > > > >>? ? ? ? ? ? ?On 09/10/2019 22:10, serafim loukas wrote: > > >>? ? ? ? ? ? ?> The predictions across solver are exactly the same > when > > >>? ? ? ? ? ? ?I run the code. > > >>? ? ? ? ? ? ?> I am using 0.21.3 version. What is yours? > > >>? ? ? ? ? ? ?> > > >>? ? ? ? ? ? ?> > > >>? ? ? ? ? ? ?> In [13]: import sklearn > > >>? ? ? ? ? ? ?> > > >>? ? ? ? ? ? ?> In [14]: sklearn.__version__ > > >>? ? ? ? ? ? ?> Out[14]: '0.21.3' > > >>? ? ? ? ? ? ?> > > >>? ? ? ? ? ? ?> > > >>? ? ? ? ? ? ?> Serafeim > > >>? ? ? ? ? ? ?> > > >>? ? ? ? ? ? ?> > > >>? ? ? ? ? ? ?> > > >>? ? ? ? ? ? ?>> On 9 Oct 2019, at 21:44, Beno?t Presles > > >>? ? ? ? ? ? ? > >>? ? ? ? ? ? ? > > >>? ? ? ? ? ? ?>> > >>? ? ? ? ? ? ?>> wrote: > > >>? ? ? ? ? ? ?>> > > >>? ? ? ? ? ? ?>> (y_pred_lbfgs==y_pred_saga).all() == False > > >>? ? ? ? ? ? ?> > > >>? ? ? ? ? ? ?> > > >>? ? ? ? ? ? ?> _______________________________________________ > > >>? ? ? ? ? ? ?> scikit-learn mailing list > > >>? ? ? ? ? ? ?> scikit-learn at python.org scikit-learn at python.org> > > >>? ? ? ? ? ? ?> > https://mail.python.org/mailman/listinfo/scikit-learn > > >>? ? ? ? ? ? ?> > > > >>? ? ? ? ? ? ?_______________________________________________ > > >>? ? ? ? ? ? ?scikit-learn mailing list > > >>? ? ? ? ? ? ?scikit-learn at python.org scikit-learn at python.org> > > >>? ? ? ? ? ? ?https://mail.python.org/mailman/listinfo/scikit-learn > > > > > >>? ? ? ? ?-- > > >>? ? ? ? ?Guillaume Lemaitre > > >>? ? ? ? ?Scikit-learn @ Inria Foundation > > >>? ? ? ? ?https://glemaitre.github.io/ > > > > > >>? ? ?-- > > >>? ? ?Guillaume Lemaitre > > >>? ? ?Scikit-learn @ Inria Foundation > > >>? ? ?https://glemaitre.github.io/ > > > > > >> -- > > >> Guillaume Lemaitre > > >> Scikit-learn @ Inria Foundation > > >> https://glemaitre.github.io/ > > > >> _______________________________________________ > > >> scikit-learn mailing list > > >> scikit-learn at python.org > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -------------- next part -------------- > > An HTML attachment was scrubbed... > > URL: < > http://mail.python.org/pipermail/scikit-learn/attachments/20191011/ > > a7052cd9/attachment-0001.html> > > > ------------------------------ > > > Subject: Digest Footer > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > ------------------------------ > > > End of scikit-learn Digest, Vol 43, Issue 21 > > ******************************************** > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > Gael Varoquaux > Research Director, INRIA Visiting professor, McGill > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 43, Issue 24 > ******************************************** > -------------- next part -------------- An HTML attachment was scrubbed... URL: From javaeurusd at gmail.com Sat Oct 12 17:08:14 2019 From: javaeurusd at gmail.com (Mike Smith) Date: Sat, 12 Oct 2019 14:08:14 -0700 Subject: [scikit-learn] scikit-learn Digest, Vol 43, Issue 25 In-Reply-To: References: Message-ID: "Second complexity does not > imply better prediction. " Complexity doesn't imply prediction? Perhaps you're having a translation error. On Sat, Oct 12, 2019 at 2:04 PM wrote: > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Re: scikit-learn Digest, Vol 43, Issue 24 (Mike Smith) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sat, 12 Oct 2019 14:04:12 -0700 > From: Mike Smith > To: scikit-learn at python.org > Subject: Re: [scikit-learn] scikit-learn Digest, Vol 43, Issue 24 > Message-ID: > 4LRy2NJvjwvVr4RgobQ at mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > "... > If I should expect good results on a pc, scikit says that needing > gpu power is > > obsolete, since certain scikit models perform better (than ml designed > for gpu) > > that are not designed for gpu, for that reason. Is this true?" > > Where do you see this written? I think that you are looking for overly > simple stories that you are not true." > > Gael, see the below from the scikit-learn FAQ. You can also find this > yourself at the main FAQ: > > [image: 2019-10-12 14_00_05-Frequently Asked Questions ? scikit-learn > 0.21.3 documentation.png] > > > On Sat, Oct 12, 2019 at 9:03 AM wrote: > > > Send scikit-learn mailing list submissions to > > scikit-learn at python.org > > > > To subscribe or unsubscribe via the World Wide Web, visit > > https://mail.python.org/mailman/listinfo/scikit-learn > > or, via email, send a message with subject or body 'help' to > > scikit-learn-request at python.org > > > > You can reach the person managing the list at > > scikit-learn-owner at python.org > > > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of scikit-learn digest..." > > > > > > Today's Topics: > > > > 1. Re: Is scikit-learn implying neural nets are the best > > regressor? (Gael Varoquaux) > > > > > > ---------------------------------------------------------------------- > > > > Message: 1 > > Date: Fri, 11 Oct 2019 13:34:33 -0400 > > From: Gael Varoquaux > > To: Scikit-learn mailing list > > Subject: Re: [scikit-learn] Is scikit-learn implying neural nets are > > the best regressor? > > Message-ID: <20191011173433.bbywiqnwjjpvsi4r at phare.normalesup.org> > > Content-Type: text/plain; charset=iso-8859-1 > > > > On Fri, Oct 11, 2019 at 10:10:32AM -0700, Mike Smith wrote: > > > In other words, according to that arrangement, is scikit-learn implying > > that > > > section 1.17 is the best regressor out of the listed, 1.1 to 1.17? > > > > No. > > > > First they are not ordered in order of complexity (Naive Bayes is > > arguably simpler than Gaussian Processes). Second complexity does not > > imply better prediction. > > > > > If I should expect good results on a pc, scikit says that needing gpu > > power is > > > obsolete, since certain scikit models perform better (than ml designed > > for gpu) > > > that are not designed for gpu, for that reason. Is this true? > > > > Where do you see this written? I think that you are looking for overly > > simple stories that you are not true. > > > > > How much hardware is a practical expectation for running the best > > > scikit models and getting the best results? > > > > This is too vague a question for which there is no answer. > > > > Ga?l > > > > > On Fri, Oct 11, 2019 at 9:02 AM > wrote: > > > > > Send scikit-learn mailing list submissions to > > > ? ? ? ? scikit-learn at python.org > > > > > To subscribe or unsubscribe via the World Wide Web, visit > > > ? ? ? ? https://mail.python.org/mailman/listinfo/scikit-learn > > > or, via email, send a message with subject or body 'help' to > > > ? ? ? ? scikit-learn-request at python.org > > > > > You can reach the person managing the list at > > > ? ? ? ? scikit-learn-owner at python.org > > > > > When replying, please edit your Subject line so it is more specific > > > than "Re: Contents of scikit-learn digest..." > > > > > > > Today's Topics: > > > > > ? ?1. Re: logistic regression results are not stable between > > > ? ? ? solvers (Andreas Mueller) > > > > > > > > > ---------------------------------------------------------------------- > > > > > Message: 1 > > > Date: Fri, 11 Oct 2019 15:42:58 +0200 > > > From: Andreas Mueller > > > To: scikit-learn at python.org > > > Subject: Re: [scikit-learn] logistic regression results are not > > stable > > > ? ? ? ? between solvers > > > Message-ID: > > > Content-Type: text/plain; charset="utf-8"; Format="flowed" > > > > > > > > > On 10/10/19 1:14 PM, Beno?t Presles wrote: > > > > > > Thanks for your answers. > > > > > > On my real data, I do not have so many samples. I have a bit more > > than > > > > 200 samples in total and I also would like to get some results > with > > > > unpenalized logisitic regression. > > > > What do you suggest? Should I switch to the lbfgs solver? > > > Yes. > > > > Am I sure that with this solver I will not have any convergence > > issue > > > > and always get the good result? Indeed, I did not get any > > convergence > > > > warning with saga, so I thought everything was fine. I noticed > some > > > > issues only when I decided to test several solvers. Without > > comparing > > > > the results across solvers, how to be sure that the optimisation > > goes > > > > well? Shouldn't scikit-learn warn the user somehow if it is not > > the case? > > > We should attempt to warn in the SAGA solver if it doesn't > converge. > > > That it doesn't raise a convergence warning should probably be > > > considered a bug. > > > It uses the maximum weight change as a stopping criterion right > now. > > > We could probably compute the dual objective once in the end to see > > if > > > we converged, right? Or is that not possible with SAGA? If not, we > > might > > > want to caution that no convergence warning will be raised. > > > > > > > > At last, I was using saga because I also wanted to do some > feature > > > > selection by using l1 penalty which is not supported by lbfgs... > > > You can use liblinear then. > > > > > > > > > > Best regards, > > > > Ben > > > > > > > > Le 09/10/2019 ? 23:39, Guillaume Lema?tre a ?crit?: > > > >> Ups I did not see the answer of Roman. Sorry about that. It is > > coming > > > >> back to the same conclusion :) > > > > > >> On Wed, 9 Oct 2019 at 23:37, Guillaume Lema?tre > > > >> > wrote: > > > > > >>? ? ?Uhm actually increasing to 10000 samples solve the > convergence > > > issue. > > > >>? ? ?SAGA is not designed to work with a so small sample size > most > > > >>? ? ?probably. > > > > > >>? ? ?On Wed, 9 Oct 2019 at 23:36, Guillaume Lema?tre > > > >>? ? ?> > > wrote: > > > > > >>? ? ? ? ?I slightly change the bench such that it uses pipeline > and > > > >>? ? ? ? ?plotted the coefficient: > > > > > >>? ? ? ? ?https://gist.github.com/glemaitre/ > > > 8fcc24bdfc7dc38ca0c09c56e26b9386 > > > > > >>? ? ? ? ?I only see one of the 10 splits where SAGA is not > > converging, > > > >>? ? ? ? ?otherwise the coefficients > > > >>? ? ? ? ?look very close (I don't attach the figure here but they > > can > > > >>? ? ? ? ?be plotted using the snippet). > > > >>? ? ? ? ?So apart from this second split, the other differences > > seems > > > >>? ? ? ? ?to be numerical instability. > > > > > >>? ? ? ? ?Where I have some concern is regarding the convergence > > rate > > > >>? ? ? ? ?of SAGA but I have no > > > >>? ? ? ? ?intuition to know if this is normal or not. > > > > > >>? ? ? ? ?On Wed, 9 Oct 2019 at 23:22, Roman Yurchak > > > >>? ? ? ? ?> > > wrote: > > > > > >>? ? ? ? ? ? ?Ben, > > > > > >>? ? ? ? ? ? ?I can confirm your results with penalty='none' and > > C=1e9. > > > >>? ? ? ? ? ? ?In both cases, > > > >>? ? ? ? ? ? ?you are running a mostly unpenalized logisitic > > > >>? ? ? ? ? ? ?regression. Usually > > > >>? ? ? ? ? ? ?that's less numerically stable than with a small > > > >>? ? ? ? ? ? ?regularization, > > > >>? ? ? ? ? ? ?depending on the data collinearity. > > > > > >>? ? ? ? ? ? ?Running that same code with > > > >>? ? ? ? ? ? ?? - larger penalty ( smaller C values) > > > >>? ? ? ? ? ? ?? - or larger number of samples > > > >>? ? ? ? ? ? ?? yields for me the same coefficients (up to some > > > tolerance). > > > > > >>? ? ? ? ? ? ?You can also see that SAGA convergence is not good > by > > the > > > >>? ? ? ? ? ? ?fact that it > > > >>? ? ? ? ? ? ?needs 196000 epochs/iterations to converge. > > > > > >>? ? ? ? ? ? ?Actually, I have often seen convergence issues with > > SAG > > > >>? ? ? ? ? ? ?on small > > > >>? ? ? ? ? ? ?datasets (in unit tests), not fully sure why. > > > > > >>? ? ? ? ? ? ?-- > > > >>? ? ? ? ? ? ?Roman > > > > > >>? ? ? ? ? ? ?On 09/10/2019 22:10, serafim loukas wrote: > > > >>? ? ? ? ? ? ?> The predictions across solver are exactly the same > > when > > > >>? ? ? ? ? ? ?I run the code. > > > >>? ? ? ? ? ? ?> I am using 0.21.3 version. What is yours? > > > >>? ? ? ? ? ? ?> > > > >>? ? ? ? ? ? ?> > > > >>? ? ? ? ? ? ?> In [13]: import sklearn > > > >>? ? ? ? ? ? ?> > > > >>? ? ? ? ? ? ?> In [14]: sklearn.__version__ > > > >>? ? ? ? ? ? ?> Out[14]: '0.21.3' > > > >>? ? ? ? ? ? ?> > > > >>? ? ? ? ? ? ?> > > > >>? ? ? ? ? ? ?> Serafeim > > > >>? ? ? ? ? ? ?> > > > >>? ? ? ? ? ? ?> > > > >>? ? ? ? ? ? ?> > > > >>? ? ? ? ? ? ?>> On 9 Oct 2019, at 21:44, Beno?t Presles > > > >>? ? ? ? ? ? ? > > >>? ? ? ? ? ? ? > > > >>? ? ? ? ? ? ?>> > > >>? ? ? ? ? ? ?>> wrote: > > > >>? ? ? ? ? ? ?>> > > > >>? ? ? ? ? ? ?>> (y_pred_lbfgs==y_pred_saga).all() == False > > > >>? ? ? ? ? ? ?> > > > >>? ? ? ? ? ? ?> > > > >>? ? ? ? ? ? ?> _______________________________________________ > > > >>? ? ? ? ? ? ?> scikit-learn mailing list > > > >>? ? ? ? ? ? ?> scikit-learn at python.org > scikit-learn at python.org> > > > >>? ? ? ? ? ? ?> > > https://mail.python.org/mailman/listinfo/scikit-learn > > > >>? ? ? ? ? ? ?> > > > > > >>? ? ? ? ? ? ?_______________________________________________ > > > >>? ? ? ? ? ? ?scikit-learn mailing list > > > >>? ? ? ? ? ? ?scikit-learn at python.org > scikit-learn at python.org> > > > >>? ? ? ? ? ? ? > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > >>? ? ? ? ?-- > > > >>? ? ? ? ?Guillaume Lemaitre > > > >>? ? ? ? ?Scikit-learn @ Inria Foundation > > > >>? ? ? ? ?https://glemaitre.github.io/ > > > > > > > > > >>? ? ?-- > > > >>? ? ?Guillaume Lemaitre > > > >>? ? ?Scikit-learn @ Inria Foundation > > > >>? ? ?https://glemaitre.github.io/ > > > > > > > > > >> -- > > > >> Guillaume Lemaitre > > > >> Scikit-learn @ Inria Foundation > > > >> https://glemaitre.github.io/ > > > > > >> _______________________________________________ > > > >> scikit-learn mailing list > > > >> scikit-learn at python.org > > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > > scikit-learn mailing list > > > > scikit-learn at python.org > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -------------- next part -------------- > > > An HTML attachment was scrubbed... > > > URL: < > > http://mail.python.org/pipermail/scikit-learn/attachments/20191011/ > > > a7052cd9/attachment-0001.html> > > > > > ------------------------------ > > > > > Subject: Digest Footer > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > ------------------------------ > > > > > End of scikit-learn Digest, Vol 43, Issue 21 > > > ******************************************** > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > -- > > Gael Varoquaux > > Research Director, INRIA Visiting professor, McGill > > http://gael-varoquaux.info > http://twitter.com/GaelVaroquaux > > > > > > ------------------------------ > > > > Subject: Digest Footer > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > ------------------------------ > > > > End of scikit-learn Digest, Vol 43, Issue 24 > > ******************************************** > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://mail.python.org/pipermail/scikit-learn/attachments/20191012/6959d075/attachment.html > > > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: 2019-10-12 14_00_05-Frequently Asked Questions ? scikit-learn 0.21.3 > documentation.png > Type: image/png > Size: 26245 bytes > Desc: not available > URL: < > http://mail.python.org/pipermail/scikit-learn/attachments/20191012/6959d075/attachment.png > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 43, Issue 25 > ******************************************** > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Sun Oct 13 06:40:11 2019 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Sun, 13 Oct 2019 19:40:11 +0900 Subject: [scikit-learn] scikit-learn Digest, Vol 43, Issue 25 In-Reply-To: References: Message-ID: Please, respect and refinement when addressing the contributors and users of scikit-learn. Gael's statement is perfect -- complexity does not imply better prediction. The choice of estimator (and algorithm) depends on the structure of the model desired for the data presented. Estimator superiority cannot be proven in a context- and/or data-agnostic fashion. J.B. 2019?10?13?(?) 6:13 Mike Smith : > "Second complexity does not > > imply better prediction. " > > Complexity doesn't imply prediction? Perhaps you're having a translation > error. > > On Sat, Oct 12, 2019 at 2:04 PM wrote: > >> Send scikit-learn mailing list submissions to >> scikit-learn at python.org >> >> To subscribe or unsubscribe via the World Wide Web, visit >> https://mail.python.org/mailman/listinfo/scikit-learn >> or, via email, send a message with subject or body 'help' to >> scikit-learn-request at python.org >> >> You can reach the person managing the list at >> scikit-learn-owner at python.org >> >> When replying, please edit your Subject line so it is more specific >> than "Re: Contents of scikit-learn digest..." >> >> >> Today's Topics: >> >> 1. Re: scikit-learn Digest, Vol 43, Issue 24 (Mike Smith) >> >> >> ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Sat, 12 Oct 2019 14:04:12 -0700 >> From: Mike Smith >> To: scikit-learn at python.org >> Subject: Re: [scikit-learn] scikit-learn Digest, Vol 43, Issue 24 >> Message-ID: >> > 4LRy2NJvjwvVr4RgobQ at mail.gmail.com> >> Content-Type: text/plain; charset="utf-8" >> >> "... > If I should expect good results on a pc, scikit says that needing >> gpu power is >> > obsolete, since certain scikit models perform better (than ml designed >> for gpu) >> > that are not designed for gpu, for that reason. Is this true?" >> >> Where do you see this written? I think that you are looking for overly >> simple stories that you are not true." >> >> Gael, see the below from the scikit-learn FAQ. You can also find this >> yourself at the main FAQ: >> >> [image: 2019-10-12 14_00_05-Frequently Asked Questions ? scikit-learn >> 0.21.3 documentation.png] >> >> >> On Sat, Oct 12, 2019 at 9:03 AM wrote: >> >> > Send scikit-learn mailing list submissions to >> > scikit-learn at python.org >> > >> > To subscribe or unsubscribe via the World Wide Web, visit >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > or, via email, send a message with subject or body 'help' to >> > scikit-learn-request at python.org >> > >> > You can reach the person managing the list at >> > scikit-learn-owner at python.org >> > >> > When replying, please edit your Subject line so it is more specific >> > than "Re: Contents of scikit-learn digest..." >> > >> > >> > Today's Topics: >> > >> > 1. Re: Is scikit-learn implying neural nets are the best >> > regressor? (Gael Varoquaux) >> > >> > >> > ---------------------------------------------------------------------- >> > >> > Message: 1 >> > Date: Fri, 11 Oct 2019 13:34:33 -0400 >> > From: Gael Varoquaux >> > To: Scikit-learn mailing list >> > Subject: Re: [scikit-learn] Is scikit-learn implying neural nets are >> > the best regressor? >> > Message-ID: <20191011173433.bbywiqnwjjpvsi4r at phare.normalesup.org> >> > Content-Type: text/plain; charset=iso-8859-1 >> > >> > On Fri, Oct 11, 2019 at 10:10:32AM -0700, Mike Smith wrote: >> > > In other words, according to that arrangement, is scikit-learn >> implying >> > that >> > > section 1.17 is the best regressor out of the listed, 1.1 to 1.17? >> > >> > No. >> > >> > First they are not ordered in order of complexity (Naive Bayes is >> > arguably simpler than Gaussian Processes). Second complexity does not >> > imply better prediction. >> > >> > > If I should expect good results on a pc, scikit says that needing gpu >> > power is >> > > obsolete, since certain scikit models perform better (than ml designed >> > for gpu) >> > > that are not designed for gpu, for that reason. Is this true? >> > >> > Where do you see this written? I think that you are looking for overly >> > simple stories that you are not true. >> > >> > > How much hardware is a practical expectation for running the best >> > > scikit models and getting the best results? >> > >> > This is too vague a question for which there is no answer. >> > >> > Ga?l >> > >> > > On Fri, Oct 11, 2019 at 9:02 AM >> wrote: >> > >> > > Send scikit-learn mailing list submissions to >> > > ? ? ? ? scikit-learn at python.org >> > >> > > To subscribe or unsubscribe via the World Wide Web, visit >> > > ? ? ? ? https://mail.python.org/mailman/listinfo/scikit-learn >> > > or, via email, send a message with subject or body 'help' to >> > > ? ? ? ? scikit-learn-request at python.org >> > >> > > You can reach the person managing the list at >> > > ? ? ? ? scikit-learn-owner at python.org >> > >> > > When replying, please edit your Subject line so it is more >> specific >> > > than "Re: Contents of scikit-learn digest..." >> > >> > >> > > Today's Topics: >> > >> > > ? ?1. Re: logistic regression results are not stable between >> > > ? ? ? solvers (Andreas Mueller) >> > >> > >> > > >> > ---------------------------------------------------------------------- >> > >> > > Message: 1 >> > > Date: Fri, 11 Oct 2019 15:42:58 +0200 >> > > From: Andreas Mueller >> > > To: scikit-learn at python.org >> > > Subject: Re: [scikit-learn] logistic regression results are not >> > stable >> > > ? ? ? ? between solvers >> > > Message-ID: >> > > Content-Type: text/plain; charset="utf-8"; Format="flowed" >> > >> > >> > >> > > On 10/10/19 1:14 PM, Beno?t Presles wrote: >> > >> > > > Thanks for your answers. >> > >> > > > On my real data, I do not have so many samples. I have a bit >> more >> > than >> > > > 200 samples in total and I also would like to get some results >> with >> > > > unpenalized logisitic regression. >> > > > What do you suggest? Should I switch to the lbfgs solver? >> > > Yes. >> > > > Am I sure that with this solver I will not have any convergence >> > issue >> > > > and always get the good result? Indeed, I did not get any >> > convergence >> > > > warning with saga, so I thought everything was fine. I noticed >> some >> > > > issues only when I decided to test several solvers. Without >> > comparing >> > > > the results across solvers, how to be sure that the optimisation >> > goes >> > > > well? Shouldn't scikit-learn warn the user somehow if it is not >> > the case? >> > > We should attempt to warn in the SAGA solver if it doesn't >> converge. >> > > That it doesn't raise a convergence warning should probably be >> > > considered a bug. >> > > It uses the maximum weight change as a stopping criterion right >> now. >> > > We could probably compute the dual objective once in the end to >> see >> > if >> > > we converged, right? Or is that not possible with SAGA? If not, we >> > might >> > > want to caution that no convergence warning will be raised. >> > >> > >> > > > At last, I was using saga because I also wanted to do some >> feature >> > > > selection by using l1 penalty which is not supported by lbfgs... >> > > You can use liblinear then. >> > >> > >> > >> > > > Best regards, >> > > > Ben >> > >> > >> > > > Le 09/10/2019 ? 23:39, Guillaume Lema?tre a ?crit?: >> > > >> Ups I did not see the answer of Roman. Sorry about that. It is >> > coming >> > > >> back to the same conclusion :) >> > >> > > >> On Wed, 9 Oct 2019 at 23:37, Guillaume Lema?tre >> > > >> > >> wrote: >> > >> > > >>? ? ?Uhm actually increasing to 10000 samples solve the >> convergence >> > > issue. >> > > >>? ? ?SAGA is not designed to work with a so small sample size >> most >> > > >>? ? ?probably. >> > >> > > >>? ? ?On Wed, 9 Oct 2019 at 23:36, Guillaume Lema?tre >> > > >>? ? ?> >> > wrote: >> > >> > > >>? ? ? ? ?I slightly change the bench such that it uses pipeline >> and >> > > >>? ? ? ? ?plotted the coefficient: >> > >> > > >>? ? ? ? ?https://gist.github.com/glemaitre/ >> > > 8fcc24bdfc7dc38ca0c09c56e26b9386 >> > >> > > >>? ? ? ? ?I only see one of the 10 splits where SAGA is not >> > converging, >> > > >>? ? ? ? ?otherwise the coefficients >> > > >>? ? ? ? ?look very close (I don't attach the figure here but >> they >> > can >> > > >>? ? ? ? ?be plotted using the snippet). >> > > >>? ? ? ? ?So apart from this second split, the other differences >> > seems >> > > >>? ? ? ? ?to be numerical instability. >> > >> > > >>? ? ? ? ?Where I have some concern is regarding the convergence >> > rate >> > > >>? ? ? ? ?of SAGA but I have no >> > > >>? ? ? ? ?intuition to know if this is normal or not. >> > >> > > >>? ? ? ? ?On Wed, 9 Oct 2019 at 23:22, Roman Yurchak >> > > >>? ? ? ? ?> >> > wrote: >> > >> > > >>? ? ? ? ? ? ?Ben, >> > >> > > >>? ? ? ? ? ? ?I can confirm your results with penalty='none' and >> > C=1e9. >> > > >>? ? ? ? ? ? ?In both cases, >> > > >>? ? ? ? ? ? ?you are running a mostly unpenalized logisitic >> > > >>? ? ? ? ? ? ?regression. Usually >> > > >>? ? ? ? ? ? ?that's less numerically stable than with a small >> > > >>? ? ? ? ? ? ?regularization, >> > > >>? ? ? ? ? ? ?depending on the data collinearity. >> > >> > > >>? ? ? ? ? ? ?Running that same code with >> > > >>? ? ? ? ? ? ?? - larger penalty ( smaller C values) >> > > >>? ? ? ? ? ? ?? - or larger number of samples >> > > >>? ? ? ? ? ? ?? yields for me the same coefficients (up to some >> > > tolerance). >> > >> > > >>? ? ? ? ? ? ?You can also see that SAGA convergence is not good >> by >> > the >> > > >>? ? ? ? ? ? ?fact that it >> > > >>? ? ? ? ? ? ?needs 196000 epochs/iterations to converge. >> > >> > > >>? ? ? ? ? ? ?Actually, I have often seen convergence issues with >> > SAG >> > > >>? ? ? ? ? ? ?on small >> > > >>? ? ? ? ? ? ?datasets (in unit tests), not fully sure why. >> > >> > > >>? ? ? ? ? ? ?-- >> > > >>? ? ? ? ? ? ?Roman >> > >> > > >>? ? ? ? ? ? ?On 09/10/2019 22:10, serafim loukas wrote: >> > > >>? ? ? ? ? ? ?> The predictions across solver are exactly the >> same >> > when >> > > >>? ? ? ? ? ? ?I run the code. >> > > >>? ? ? ? ? ? ?> I am using 0.21.3 version. What is yours? >> > > >>? ? ? ? ? ? ?> >> > > >>? ? ? ? ? ? ?> >> > > >>? ? ? ? ? ? ?> In [13]: import sklearn >> > > >>? ? ? ? ? ? ?> >> > > >>? ? ? ? ? ? ?> In [14]: sklearn.__version__ >> > > >>? ? ? ? ? ? ?> Out[14]: '0.21.3' >> > > >>? ? ? ? ? ? ?> >> > > >>? ? ? ? ? ? ?> >> > > >>? ? ? ? ? ? ?> Serafeim >> > > >>? ? ? ? ? ? ?> >> > > >>? ? ? ? ? ? ?> >> > > >>? ? ? ? ? ? ?> >> > > >>? ? ? ? ? ? ?>> On 9 Oct 2019, at 21:44, Beno?t Presles >> > > >>? ? ? ? ? ? ?> > > >>? ? ? ? ? ? ? >> > > >>? ? ? ? ? ? ?>> > > > >>? ? ? ? ? ? ?>> wrote: >> > > >>? ? ? ? ? ? ?>> >> > > >>? ? ? ? ? ? ?>> (y_pred_lbfgs==y_pred_saga).all() == False >> > > >>? ? ? ? ? ? ?> >> > > >>? ? ? ? ? ? ?> >> > > >>? ? ? ? ? ? ?> _______________________________________________ >> > > >>? ? ? ? ? ? ?> scikit-learn mailing list >> > > >>? ? ? ? ? ? ?> scikit-learn at python.org > > scikit-learn at python.org> >> > > >>? ? ? ? ? ? ?> >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > > >>? ? ? ? ? ? ?> >> > >> > > >>? ? ? ? ? ? ?_______________________________________________ >> > > >>? ? ? ? ? ? ?scikit-learn mailing list >> > > >>? ? ? ? ? ? ?scikit-learn at python.org > > scikit-learn at python.org> >> > > >>? ? ? ? ? ? ? >> https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > >> > >> > > >>? ? ? ? ?-- >> > > >>? ? ? ? ?Guillaume Lemaitre >> > > >>? ? ? ? ?Scikit-learn @ Inria Foundation >> > > >>? ? ? ? ?https://glemaitre.github.io/ >> > >> > >> > >> > > >>? ? ?-- >> > > >>? ? ?Guillaume Lemaitre >> > > >>? ? ?Scikit-learn @ Inria Foundation >> > > >>? ? ?https://glemaitre.github.io/ >> > >> > >> > >> > > >> -- >> > > >> Guillaume Lemaitre >> > > >> Scikit-learn @ Inria Foundation >> > > >> https://glemaitre.github.io/ >> > >> > > >> _______________________________________________ >> > > >> scikit-learn mailing list >> > > >> scikit-learn at python.org >> > > >> https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > > > _______________________________________________ >> > > > scikit-learn mailing list >> > > > scikit-learn at python.org >> > > > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > > -------------- next part -------------- >> > > An HTML attachment was scrubbed... >> > > URL: < >> > http://mail.python.org/pipermail/scikit-learn/attachments/20191011/ >> > > a7052cd9/attachment-0001.html> >> > >> > > ------------------------------ >> > >> > > Subject: Digest Footer >> > >> > > _______________________________________________ >> > > scikit-learn mailing list >> > > scikit-learn at python.org >> > > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > >> > > ------------------------------ >> > >> > > End of scikit-learn Digest, Vol 43, Issue 21 >> > > ******************************************** >> > >> > >> > > _______________________________________________ >> > > scikit-learn mailing list >> > > scikit-learn at python.org >> > > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > >> > -- >> > Gael Varoquaux >> > Research Director, INRIA Visiting professor, McGill >> > http://gael-varoquaux.info >> http://twitter.com/GaelVaroquaux >> > >> > >> > ------------------------------ >> > >> > Subject: Digest Footer >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > >> > ------------------------------ >> > >> > End of scikit-learn Digest, Vol 43, Issue 24 >> > ******************************************** >> > >> -------------- next part -------------- >> An HTML attachment was scrubbed... >> URL: < >> http://mail.python.org/pipermail/scikit-learn/attachments/20191012/6959d075/attachment.html >> > >> -------------- next part -------------- >> A non-text attachment was scrubbed... >> Name: 2019-10-12 14_00_05-Frequently Asked Questions ? scikit-learn >> 0.21.3 documentation.png >> Type: image/png >> Size: 26245 bytes >> Desc: not available >> URL: < >> http://mail.python.org/pipermail/scikit-learn/attachments/20191012/6959d075/attachment.png >> > >> >> ------------------------------ >> >> Subject: Digest Footer >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> ------------------------------ >> >> End of scikit-learn Digest, Vol 43, Issue 25 >> ******************************************** >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nelle.varoquaux at gmail.com Mon Oct 14 04:36:26 2019 From: nelle.varoquaux at gmail.com (Nelle Varoquaux) Date: Mon, 14 Oct 2019 10:36:26 +0200 Subject: [scikit-learn] Announcement -- scikit-image 0.16." released Message-ID: Hi All, On behalf of the scikit-image team I am pleased to announce that scikit-image 0.16.1 has been released (0.16.0 was never released due to necessary last minute fixes). This release contains many bugs fixes and new features! Please note that we have dropped support for Python 3.5. Announcement: scikit-image 0.16.1 ================================= We're happy to announce the release of scikit-image v0.16.1! scikit-image is an image processing toolbox for SciPy that includes algorithms for segmentation, geometric transformations, color space manipulation, analysis, filtering, morphology, feature detection, and more. For more information, examples, and documentation, please visit our website: https://scikit-image.org Starting from this release, scikit-image will follow the recently introduced NumPy deprecation policy, `NEP 29 < https://github.com/numpy/numpy/blob/master/doc/neps/nep-0029-deprecation_policy.rst >__`. Accordingly, scikit-image 0.16 drops support for Python 3.5. This release of scikit-image officially supports Python 3.6 and 3.7. Special thanks to Matthias Bussonnier for `Frappuccino `__, which helped us catch all API changes and nail down the APIs for new features. New Features ------------ - New `skimage.evaluate` module containing simple metrics (mse, nrme, psd) and segmentation metrics (adapted rand error, variation of information) (#4025) - n-dimensional TV-L1 optical flow algorithm for registration -- `skimage.registration.optical_flow_tvl1` (#3983) - Draw a line in an n-dimensional array -- `skimage.draw.line_nd` (#2043) - 2D Farid & Simoncelli edge filters - `skimage.filters.farid`, `skimage.filters.farid_h`, and `skimage.filters.farid_v` (#3775) - 2D majority voting filter assigning to each pixel the most commonly occurring value within its neighborhood -- `skimage.filters.majority` (#3836, #3839) - Multi-level threshold "multi-Otsu" method, a thresholding algorithm used to separate the pixels of an input image into several classes by maximizing the variances between classes -- `skimage.filters.threshold_multiotsu` (#3872, #4174) - New example data -- `skimage.data.shepp_logan_phantom`, `skimage.data.colorwheel`, `skimage.data.brick`, `skimage.data.grass`, `skimage.data.roughwall`, `skimage.data.cell` (#3958, #3966) - Compute and format image region properties as a table -- `skimage.measure.regionprops_table` (#3959) - Convert a polygon into a mask -- `skimage.draw.poly2mask` (#3971, #3977) - Visual image comparison helper `skimage.util.compare_images`, that returns an image showing the difference between two input images (#4089) - `skimage.transform.warp_polar` to remap image into polar or log-polar coordinates. (#4097) Improvements ------------ - RANSAC: new option to set initial samples selected for initialization (#2992) - Better repr and str for `skimage.transform.ProjectiveTransform` (#3525, #3967) - Better error messages and data type stability to `skimage.segmentation.relabel_sequential` (#3740) - Improved compatibility with dask arrays in some image thresholding methods (#3823) - `skimage.io.ImageCollection` can now receive lists of patterns (#3928) - Speed up `skimage.feature.peak_local_max` (#3984) - Better error message when incorrect value for keyword argument `kind` in `skimage.color.label2rgb` (#4055) - All functions from `skimage.drawing` now supports multi-channel 2D images (#4134) API Changes ----------- - Deprecated subpackage ``skimage.novice`` has been removed. - Default value of ``multichannel`` parameters has been set to False in `skimage.transform.rescale`, `skimage.transform.pyramid_reduce`, `skimage.transform.pyramid_laplacian`, `skimage.transform.pyramid_gaussian`, and `skimage.transform.pyramid_expand`. Guessing is no longer performed for 3D arrays. - Deprecated argument ``visualise`` has been removed from `skimage.feature.hog`. Use ``visualize`` instead.? - `skimage.transform.seam_carve` has been completely removed from the library due to licensing restrictions. - Parameter ``as_grey`` has been removed from `skimage.data.load` and `skimage.io.imread`. Use ``as_gray`` instead. - Parameter ``min_size`` has been removed from `skimage.morphology.remove_small_holes`. Use ``area_threshold`` instead. - Deprecated ``correct_mesh_orientation`` in `skimage.measure` has been removed. - `skimage.measure._regionprops` has been completely switched to using row-column coordinates. Old x-y interface is not longer available. - Default value of ``behavior`` parameter has been set to ``ndimage`` in `skimage.filters.median`. - Parameter ``flatten`` in `skimage.io.imread` has been removed in favor of ``as_gray``. - Parameters ``Hxx, Hxy, Hyy`` have been removed from `skimage.feature.corner.hessian_matrix_eigvals` in favor of ``H_elems``. - Default value of ``order`` parameter has been set to ``rc`` in `skimage.feature.hessian_matrix`. - ``skimage.util.img_as_*`` functions no longer raise precision and/or loss warnings. Bugfixes -------- - Corrected error with scales attribute in ORB.detect_and_extract (#2835) The scales attribute wasn't taking into account the mask, and thus was using an incorrect array size. - Correct for bias in Inverse Randon Transform (`skimage.transform.irandon`) (#3067) Fixed by using the Ramp filter equation in the spatial domain as described in the reference - Fix a rounding issue that caused a rotated image to have a different size than the input (`skimage.transform.rotate`) (#3173) - RANSAC uses random subsets of the original data and not bootstraps. (#3901, #3915) - Canny now produces the same output regardless of dtype (#3919) - Geometry Transforms: avoid division by zero & some degenerate cases (#3926) - Fixed float32 support in denoise_bilateral and denoise_tv_bregman (#3936) - Fixed computation of Meijering filter and avoid ZeroDivisionError (#3957) - Fixed `skimage.filters.threshold_li` to prevent being stuck on stationnary points, and thus at local minima or maxima (#3966) - Edited `skimage.exposure.rescale_intensity` to return input image instead of nans when all 0 (#4015) - Fixed `skimage.morphology.medial_axis`. A wrong indentation in Cython caused the function to not behave as intended. (#4060) - Fixed `skimage.restoration.denoise_bilateral` by correcting the padding in the gaussian filter(#4080) - Fixed `skimage.measure.find_contours` when input image contains NaN. Contours interesting NaN will be left open (#4150) - Fixed `skimage.feature.blob_log` and `skimage.feature.blob_dog` for 3D images and anisotropic data (#4162) - Fixed `skimage.exposure.adjust_gamma`, `skimage.exposure.adjust_log`, and `skimage.exposure.adjust_sigmoid` such that when provided with a 1 by 1 ndarray, it returns 1 by 1 ndarrays and not single number floats (#4169) Deprecations ------------ - Parameter ``neighbors`` in `skimage.measure.convex_hull_object` has been deprecated in favor of ``connectivity`` and will be removed in version 0.18.0. - The following functions are deprecated in favor of the `skimage.metrics` module (#4025): - `skimage.measure.compare_mse` - `skimage.measure.compare_nrmse` - `skimage.measure.compare_psnr` - `skimage.measure.compare_ssim` - The function `skimage.color.guess_spatial_dimensions` is deprecated and will be removed in 0.18 (#4031) - The argument ``bc`` in `skimage.segmentation.active_contour` is deprecated. - The function `skimage.data.load` is deprecated and will be removed in 0.18 (#4061) - The function `skimage.transform.match_histogram` is deprecated in favor of `skimage.exposure.match_histogram` (#4107) - The parameter ``neighbors`` of `skimage.morphology.convex_hull_object` is deprecated. - The `skimage.transform.randon_tranform` function will convert input image of integer type to float by default in 0.18. To preserve current behaviour, set the new argument ``preserve_range`` to True. (#4131) Documentation improvements -------------------------- - DOC: Improve the documentation of transform.resize with respect to the anti_aliasing_sigma parameter (#3911) - Fix URL for stain deconvolution reference (#3862) - Fix doc for denoise guassian (#3869) - DOC: various enhancements (cross links, gallery, ref...), mainly for corner detection (#3996) - [DOC] clarify that the inertia_tensor may be nD in documentation (#4013) - [DOC] How to test and write benchmarks (#4016) - Spellcheck @CONTRIBUTING.txt (#4008) - Spellcheck @doc/examples/segmentation/plot_watershed.py (#4009) - Spellcheck @doc/examples/segmentation/plot_thresholding.py (#4010) - Spellcheck @skimage/morphology/binary.py (#4011) - Spellcheck @skimage/morphology/extrema.py (#4012) - docs update for downscale_local_mean and N-dimensional images (#4079) - Remove fancy language from 0.15 release notes (#3827) - Documentation formatting / compilation fixes (#3838) - Remove duplicated section in INSTALL.txt. (#3876) - ENH: doc of ridge functions (#3933) - Fix docstring for Threshold Niblack (#3917) - adding docs to circle_perimeter_aa (#4155) - Update link to NumPy docstring standard in Contribution Guide (replaces #4191) (#4192) - DOC: Improve downscale_local_mean() docstring (#4180) - DOC: enhance the result display in ransac gallery example (#4109) - Gallery: use fstrings for better readability (#4110) - MNT: Document stacklevel parameter in contribution guide (#4066) - Fix minor typo (#3988) - MIN: docstring improvements in canny functions (#3920) - Minor docstring fixes for #4150 (#4184) - Fix `full` parameter description in compare_ssim (#3860) - State Bradley threshold equivalence in Niblack docstring (#3891) - Add plt.show() to example-code for consistency. (#3908) - CC0 is not equivalent to public domain. Fix the note of the horse image (#3931) - Update the joblib link in tutorial_parallelization.rst (#3943) - Fix plot_edge_filter.py references (#3946) - Add missing argument to docstring of PaintTool (#3970) - Improving documentation and tests for directional filters (#3956) - Added new thorough examples on the inner working of ``skimage.filters.threshold_li`` (#3966) - matplotlib: remove interpolation=nearest, none in our examples (#4002) - fix URL encoding for wikipedia references in filters.rank.entropy and filters.rank.shannon_entropy docstring (#4007) - Fixup integer division in examples (#4032) - Update the links the installation guide (#4118) - Gallery hough line transform (#4124) - Cross-linking between function documentation should now be much improved! (#4188) - Better documentation of the ``num_peaks`` of `skimage.feature.corner_peaks` (#4195) Other Pull Requests ------------------- - Add benchmark suite for exposure module (#3312) - Remove precision and sign loss warnings from ``skimage.util.img_as_`` (#3575) - Propose SKIPs and add mission/vision/values, governance (#3585) - Use user-installed tifffile if available (#3650) - Simplify benchmarks pinnings (#3711) - Add project_urls to setup for PyPI and other services (#3834) - Address deprecations for 0.16 release (#3841) - Followup deprecations for 0.16 (#3851) - Build and test the docs in Azure (#3873) - Pin numpydoc to pre-0.8 to fix dev docs formatting (#3893) - Change all HTTP links to HTTPS (#3896) - Skip extra deps on OSX (#3898) - Add location for Sphinx 2.0.1 search results; clean up templates (#3899) - Fix CSS styling of Sphinx 2.0.1 + numpydoc 0.9 rendered docs (#3900) - Travis CI: The sudo: tag is deprcated in Travis (#4164) - MNT Preparing the 0.16 release (#4204) - FIX generate_release_note when contributor_set contains None (#4205) - Specify that travis should use Ubuntu xenial (14.04) not trusty (16.04) (#4082) - MNT: set stack level accordingly in lab2xyz (#4067) - MNT: fixup stack level for filters ridges (#4068) - MNT: remove unused import `deprecated` from filters.thresholding (#4069) - MNT: Set stacklevel correctly in io matplotlib plugin (#4070) - MNT: set stacklevel accordingly in felzenszwalb_cython (#4071) - MNT: Set stacklevel accordingly in img_as_* (convert) (#4072) - MNT: set stacklevel accordingly in util.shape (#4073) - MNT: remove extreneous matplotlib warning (#4074) - Suppress warnings in tests for viewer (#4017) - Suppress warnings in test suite regarding measure.label (#4018) - Suppress warnings in test_rank due to type conversion (#4019) - Add todo item for imread plugin testing (#3907) - Remove matplotlib agg warning when using the sphinx gallery. (#3897) - Forward-port release notes for 0.14.4 (#4137) - Add tests for pathological arrays in threshold_li (#4143) - setup.py: Fail gracefully when NumPy is not installed (#4181) - Drop Python 3.5 support (#4102) - Force imageio reader to return NumPy arrays (#3837) - Fixing connecting to GitHub with SSH info. (#3875) - Small fix to an error message of `skimage.measure.regionprops` (#3884) - Unify skeletonize and skeletonize 3D APIs (#3904) - Add location for Sphinx 2.0.1 search results; clean up templates (#3910) - Pin numpy version forward (#3925) - Replacing pyfits with Astropy to read FITS (#3930) - Add warning for future dtype kwarg removal (#3932) - MAINT: cleanup regionprop add PYTHONOPTIMIZE=2 to travis array (#3934) - Adding complexity and new tests for filters.threshold_multiotsu (#3935) - Fixup dtype kwarg warning in certain image plugins (#3948) - don't cast integer to float before using it as integer in numpy logspace (#3949) - avoid low contrast image save in a doctest. (#3953) - MAINT: Remove unused _convert_input from filters._gaussian (#4001) - Set minimum version for imread so that it compiles from source on linux in test builds (#3960) - Cleanup plugin utilization in data.load and testsuite (#3961) - Select minimum imageio such that it is compatible with pathlib (#3969) - Remove pytest-faulthandler from test dependencies (#3987) - Fix tifffile and __array_function__ failures in our CI (#3992) - MAINT: Do not use assert in code, raise an exception instead. (#4006) - Enable packagers to disable failures on warnings. (#4021) - Fix numpy 117 rc and dask in thresholding filters (#4022) - silence r,c warnings when property does not depend on r,c (#4027) - remove warning filter, fix doc wrt r,c (#4028) - Import Iterable from collections.abc (#4033) - Import Iterable from collections.abc in vendored tifffile code (#4034) - Correction of typos after #4025 (#4036) - Rename internal function called assert_* -> check_* (#4037) - Improve import time (#4039) - Remove .meeseeksdev.yml (#4045) - Fix mpl deprecation on grid() (#4049) - Fix gallery after deprecation from #4025 (#4050) - fix mpl future deprecation normed -> density (#4053) - Add shape= to circle perimeter in hough_circle example (#4047) - Critical: address internal warnings in test suite related to metrics 4025 (#4063) - Use functools instead of a real function for the internal warn function (#4062) - Test rank capture warnings in threadsafe manner (#4064) - Make use of FFTs more consistent across the library (#4084) - Fixup region props test (#4099) - Turn single backquotes to double backquotes in filters (#4127) - Refactor radon transform module (#4136) - Fix broken import of rgb2gray in benchmark suite (#4176) - Fix doc building issues with SKIPs (#4182) - Remove several __future__ imports (#4198) - Restore deprecated coordinates arg to regionprops (#4144) - Refactor/optimize threshold_multiotsu (#4167) - Remove Python2-specific code (#4170) - `view_as_windows` incorrectly assumes that a contiguous array is needed (#4171) - Handle case in which NamedTemporaryFile fails (#4172) - Fix incorrect resolution date on SKIP1 (#4183) - API updates before 0.16 (#4187) - Fix conversion to float32 dtype (#4193) Contributors to this release ---------------------------- - Abhishek Arya - Alexandre de Siqueira - Alexis Mignon - Anthony Carapetis - Bastian Eichenberger - Bharat Raghunathan - Christian Clauss - Clement Ng - David Breuer - David Haberth?r - Dominik Kutra - Dominik Straub - Egor Panfilov - Emmanuelle Gouillart - Etienne Landur? - Fran?ois Boulogne - Genevieve Buckley - Gregory R. Lee - Hadrien Mary - Hamdi Sahloul - Holly Gibbs - Huang-Wei Chang - i3v (i3v) - Jarrod Millman - Jirka Borovec - Johan Jeppsson - Johannes Sch?nberger - Jon Crall - Josh Warner - Juan Nunez-Iglesias - Kaligule (Kaligule) - kczimm (kczimm) - Lars Grueter - Shachar Ben Harim - Luis F. de Figueiredo - Mark Harfouche - Mars Huang - Dave Mellert - Nelle Varoquaux - Ollin Boer Bohan - Patrick J Zager - Riadh Fezzani - Ryan Avery - Srinath Kailasa - Stefan van der Walt - Stuart Berg - Uwe Schmidt Reviewers for this release -------------------------- - Alexandre de Siqueira - Anthony Carapetis - Bastian Eichenberger - Clement Ng - David Breuer - Egor Panfilov - Emmanuelle Gouillart - Etienne Landur? - Fran?ois Boulogne - Genevieve Buckley - Gregory R. Lee - Hadrien Mary - Hamdi Sahloul - Holly Gibbs - Jarrod Millman - Jirka Borovec - Johan Jeppsson - Johannes Sch?nberger - Jon Crall - Josh Warner - jrmarsha - Juan Nunez-Iglesias - kczimm - Lars Grueter - leGIT-bot - Mark Harfouche - Mars Huang - Dave Mellert - Paul M?ller - Phil Starkey - Ralf Gommers - Riadh Fezzani - Ryan Avery - Sebastian Berg - Stefan van der Walt - Uwe Schmidt -------------- next part -------------- An HTML attachment was scrubbed... URL: From glennmschultz at me.com Mon Oct 14 13:55:02 2019 From: glennmschultz at me.com (Glenn Schultz) Date: Mon, 14 Oct 2019 17:55:02 -0000 Subject: [scikit-learn] using numpy repeat Message-ID: <119372af-8543-4f9c-ae56-4c6552e04f1c@me.com> I am tying to repeat an array 3 times using the following numpy.repeat(numpy.linspace(-.5, 3, 8), 3) axis = 0) ? although this repeats each element 3 times sequentially.?I trying to repeat the array? -.5?-0?.5 ...?-5 -0?.5 any suggestions to accomplish this are appreciated. ? I am relatively sure I can do this with numpy I just can't put my finger on how this is done by reviewing the numpy docs. Thanks, Glenn -------------- next part -------------- An HTML attachment was scrubbed... URL: From niourf at gmail.com Mon Oct 14 13:56:41 2019 From: niourf at gmail.com (Nicolas Hug) Date: Mon, 14 Oct 2019 13:56:41 -0400 Subject: [scikit-learn] using numpy repeat In-Reply-To: <119372af-8543-4f9c-ae56-4c6552e04f1c@me.com> References: <119372af-8543-4f9c-ae56-4c6552e04f1c@me.com> Message-ID: <517bd570-a5d2-513e-98b4-1e149cd6145d@gmail.com> You're looking for np.tile. It's one of the first google results and it's also linked in the doc of np.repeat. This mailing-list is for questions related to scikit-learn. I think your question would be more appropriate for e.g. stack-overflow. On 10/14/19 1:55 PM, Glenn Schultz via scikit-learn wrote: > I am tying to repeat an array 3 times using the following > > numpy.repeat(numpy.linspace(-.5, 3, 8), 3) axis = 0) > > although this repeats each element 3 times sequentially.?I trying to > repeat the array > -.5?-0?.5 ...?-5 -0?.5 > > any suggestions to accomplish this are appreciated. ? I am relatively > sure I can do this with numpy I just can't put my finger on how this > is done by reviewing the numpy docs. > > Thanks, > Glenn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniele at grinta.net Mon Oct 14 19:12:42 2019 From: daniele at grinta.net (Daniele Nicolodi) Date: Mon, 14 Oct 2019 17:12:42 -0600 Subject: [scikit-learn] Text categorization with prediction likelihood Message-ID: <6245523b-9f22-c4b4-5ed5-eca561713e57@grinta.net> Hello, I don't have any formal education on predictive models, thus I hope my questions are not too naive and that the terminology I use is correct enough to make me understood. I'm trying to implement simple text categorization of phrases of a few words (the specific application is categorization of bank transaction from payee names). Following the documentation I easily implemented a solution based on the TF-IDF vectorizer and C-Support Vector machine classification. However, the problem is such that for some input phrases the classification prediction does not work that well. I have a couple of (probably very basic questions): - are my choices of algorithms the best to target this problem? Is there something else I can try to experiment with to see if I can get better results? - is there a way to obtain the prediction likelihood such that I could mark "bad" prediction for further inspection? I haven't found an (easy) way to do that in the documentation. Thank you in advance. Cheers, Dan From gael.varoquaux at normalesup.org Wed Oct 16 10:02:40 2019 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Wed, 16 Oct 2019 16:02:40 +0200 Subject: [scikit-learn] scikit-learn Digest, Vol 43, Issue 25 In-Reply-To: References: Message-ID: <20191016140240.3c227ljcvdplzvmb@phare.normalesup.org> On Sun, Oct 13, 2019 at 07:40:11PM +0900, Brown J.B. via scikit-learn wrote: > Please, respect and refinement when addressing the contributors and users of > scikit-learn. I believe that Mike simply misread. It's something that happens (it happens a lot to me). No harm on my side, and thanks for clarifying my overly short reply. G > Gael's statement is perfect -- complexity does not imply better prediction. > The choice of estimator (and algorithm) depends on the structure of the model > desired for the data presented. > Estimator superiority cannot be proven in a context- and/or data-agnostic > fashion. > J.B. > 2019?10?13?(?) 6:13 Mike Smith : > "Second complexity does not > > imply better prediction.?"? > Complexity doesn't imply prediction? Perhaps you're having a translation > error. > On Sat, Oct 12, 2019 at 2:04 PM wrote: > Send scikit-learn mailing list submissions to > ? ? ? ? scikit-learn at python.org > To subscribe or unsubscribe via the World Wide Web, visit > ? ? ? ? https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > ? ? ? ? scikit-learn-request at python.org > You can reach the person managing the list at > ? ? ? ? scikit-learn-owner at python.org > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > Today's Topics: > ? ?1. Re: scikit-learn Digest, Vol 43, Issue 24 (Mike Smith) > ---------------------------------------------------------------------- > Message: 1 > Date: Sat, 12 Oct 2019 14:04:12 -0700 > From: Mike Smith > To: scikit-learn at python.org > Subject: Re: [scikit-learn] scikit-learn Digest, Vol 43, Issue 24 > Message-ID: > ? ? ? ? 4LRy2NJvjwvVr4RgobQ at mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > "...? > If I should expect good results on a pc, scikit says that > needing > gpu power is > > obsolete, since certain scikit models perform better (than ml > designed > for gpu) > > that are not designed for gpu, for that reason. Is this true?" > Where do you see this written? I think that you are looking for overly > simple stories that you are not true." > Gael, see the below from the scikit-learn FAQ. You can also find this > yourself at the main FAQ: > [image: 2019-10-12 14_00_05-Frequently Asked Questions ? scikit-learn > 0.21.3 documentation.png] > On Sat, Oct 12, 2019 at 9:03 AM > wrote: > > Send scikit-learn mailing list submissions to > >? ? ? ? ?scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > >? ? ? ? ?https://mail.python.org/mailman/listinfo/scikit-learn > > or, via email, send a message with subject or body 'help' to > >? ? ? ? ?scikit-learn-request at python.org > > You can reach the person managing the list at > >? ? ? ? ?scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of scikit-learn digest..." > > Today's Topics: > >? ? 1. Re: Is scikit-learn implying neural nets are the best > >? ? ? ?regressor? (Gael Varoquaux) > ---------------------------------------------------------------------- > > Message: 1 > > Date: Fri, 11 Oct 2019 13:34:33 -0400 > > From: Gael Varoquaux > > To: Scikit-learn mailing list > > Subject: Re: [scikit-learn] Is scikit-learn implying neural nets are > >? ? ? ? ?the best regressor? > > Message-ID: <20191011173433.bbywiqnwjjpvsi4r at phare.normalesup.org> > > Content-Type: text/plain; charset=iso-8859-1 > > On Fri, Oct 11, 2019 at 10:10:32AM -0700, Mike Smith wrote: > > > In other words, according to that arrangement, is scikit-learn > implying > > that > > > section 1.17 is the best regressor out of the listed, 1.1 to 1.17? > > No. > > First they are not ordered in order of complexity (Naive Bayes is > > arguably simpler than Gaussian Processes). Second complexity does not > > imply better prediction. > > > If I should expect good results on a pc, scikit says that needing > gpu > > power is > > > obsolete, since certain scikit models perform better (than ml > designed > > for gpu) > > > that are not designed for gpu, for that reason. Is this true? > > Where do you see this written? I think that you are looking for > overly > > simple stories that you are not true. > > > How much hardware is a practical expectation for running the best > > > scikit models and getting the best results? > > This is too vague a question for which there is no answer. > > Ga?l > > > On Fri, Oct 11, 2019 at 9:02 AM > wrote: > > >? ? ?Send scikit-learn mailing list submissions to > > >? ? ?? ? ? ? scikit-learn at python.org > > >? ? ?To subscribe or unsubscribe via the World Wide Web, visit > > >? ? ?? ? ? ? https://mail.python.org/mailman/listinfo/scikit-learn > > >? ? ?or, via email, send a message with subject or body 'help' to > > >? ? ?? ? ? ? scikit-learn-request at python.org > > >? ? ?You can reach the person managing the list at > > >? ? ?? ? ? ? scikit-learn-owner at python.org > > >? ? ?When replying, please edit your Subject line so it is more > specific > > >? ? ?than "Re: Contents of scikit-learn digest..." > > >? ? ?Today's Topics: > > >? ? ?? ?1. Re: logistic regression results are not stable between > > >? ? ?? ? ? solvers (Andreas Mueller) > >? > ---------------------------------------------------------------------- > > >? ? ?Message: 1 > > >? ? ?Date: Fri, 11 Oct 2019 15:42:58 +0200 > > >? ? ?From: Andreas Mueller > > >? ? ?To: scikit-learn at python.org > > >? ? ?Subject: Re: [scikit-learn] logistic regression results are not > > stable > > >? ? ?? ? ? ? between solvers > > >? ? ?Message-ID: > > >? ? ?Content-Type: text/plain; charset="utf-8"; Format="flowed" > > >? ? ?On 10/10/19 1:14 PM, Beno?t Presles wrote: > > >? ? ?> Thanks for your answers. > > >? ? ?> On my real data, I do not have so many samples. I have a bit > more > > than > > >? ? ?> 200 samples in total and I also would like to get some > results with > > >? ? ?> unpenalized logisitic regression. > > >? ? ?> What do you suggest? Should I switch to the lbfgs solver? > > >? ? ?Yes. > > >? ? ?> Am I sure that with this solver I will not have any > convergence > > issue > > >? ? ?> and always get the good result? Indeed, I did not get any > > convergence > > >? ? ?> warning with saga, so I thought everything was fine. I > noticed some > > >? ? ?> issues only when I decided to test several solvers. Without > > comparing > > >? ? ?> the results across solvers, how to be sure that the > optimisation > > goes > > >? ? ?> well? Shouldn't scikit-learn warn the user somehow if it is > not > > the case? > > >? ? ?We should attempt to warn in the SAGA solver if it doesn't > converge. > > >? ? ?That it doesn't raise a convergence warning should probably be > > >? ? ?considered a bug. > > >? ? ?It uses the maximum weight change as a stopping criterion right > now. > > >? ? ?We could probably compute the dual objective once in the end to > see > > if > > >? ? ?we converged, right? Or is that not possible with SAGA? If not, > we > > might > > >? ? ?want to caution that no convergence warning will be raised. > > >? ? ?> At last, I was using saga because I also wanted to do some > feature > > >? ? ?> selection by using l1 penalty which is not supported by > lbfgs... > > >? ? ?You can use liblinear then. > > >? ? ?> Best regards, > > >? ? ?> Ben > > >? ? ?> Le 09/10/2019 ? 23:39, Guillaume Lema?tre a ?crit?: > > >? ? ?>> Ups I did not see the answer of Roman. Sorry about that. It > is > > coming > > >? ? ?>> back to the same conclusion :) > > >? ? ?>> On Wed, 9 Oct 2019 at 23:37, Guillaume Lema?tre > > >? ? ?>> > > wrote: > > >? ? ?>>? ? ?Uhm actually increasing to 10000 samples solve the > convergence > > >? ? ?issue. > > >? ? ?>>? ? ?SAGA is not designed to work with a so small sample size > most > > >? ? ?>>? ? ?probably. > > >? ? ?>>? ? ?On Wed, 9 Oct 2019 at 23:36, Guillaume Lema?tre > > >? ? ?>>? ? ?> > > wrote: > > >? ? ?>>? ? ? ? ?I slightly change the bench such that it uses > pipeline and > > >? ? ?>>? ? ? ? ?plotted the coefficient: > > >? ? ?>>? ? ? ? ?https://gist.github.com/glemaitre/ > > >? ? ?8fcc24bdfc7dc38ca0c09c56e26b9386 > > >? ? ?>>? ? ? ? ?I only see one of the 10 splits where SAGA is not > > converging, > > >? ? ?>>? ? ? ? ?otherwise the coefficients > > >? ? ?>>? ? ? ? ?look very close (I don't attach the figure here but > they > > can > > >? ? ?>>? ? ? ? ?be plotted using the snippet). > > >? ? ?>>? ? ? ? ?So apart from this second split, the other > differences > > seems > > >? ? ?>>? ? ? ? ?to be numerical instability. > > >? ? ?>>? ? ? ? ?Where I have some concern is regarding the > convergence > > rate > > >? ? ?>>? ? ? ? ?of SAGA but I have no > > >? ? ?>>? ? ? ? ?intuition to know if this is normal or not. > > >? ? ?>>? ? ? ? ?On Wed, 9 Oct 2019 at 23:22, Roman Yurchak > > >? ? ?>>? ? ? ? ? > wrote: > > >? ? ?>>? ? ? ? ? ? ?Ben, > > >? ? ?>>? ? ? ? ? ? ?I can confirm your results with penalty='none' > and > > C=1e9. > > >? ? ?>>? ? ? ? ? ? ?In both cases, > > >? ? ?>>? ? ? ? ? ? ?you are running a mostly unpenalized logisitic > > >? ? ?>>? ? ? ? ? ? ?regression. Usually > > >? ? ?>>? ? ? ? ? ? ?that's less numerically stable than with a small > > >? ? ?>>? ? ? ? ? ? ?regularization, > > >? ? ?>>? ? ? ? ? ? ?depending on the data collinearity. > > >? ? ?>>? ? ? ? ? ? ?Running that same code with > > >? ? ?>>? ? ? ? ? ? ?? - larger penalty ( smaller C values) > > >? ? ?>>? ? ? ? ? ? ?? - or larger number of samples > > >? ? ?>>? ? ? ? ? ? ?? yields for me the same coefficients (up to > some > > >? ? ?tolerance). > > >? ? ?>>? ? ? ? ? ? ?You can also see that SAGA convergence is not > good by > > the > > >? ? ?>>? ? ? ? ? ? ?fact that it > > >? ? ?>>? ? ? ? ? ? ?needs 196000 epochs/iterations to converge. > > >? ? ?>>? ? ? ? ? ? ?Actually, I have often seen convergence issues > with > > SAG > > >? ? ?>>? ? ? ? ? ? ?on small > > >? ? ?>>? ? ? ? ? ? ?datasets (in unit tests), not fully sure why. > > >? ? ?>>? ? ? ? ? ? ?-- > > >? ? ?>>? ? ? ? ? ? ?Roman > > >? ? ?>>? ? ? ? ? ? ?On 09/10/2019 22:10, serafim loukas wrote: > > >? ? ?>>? ? ? ? ? ? ?> The predictions across solver are exactly the > same > > when > > >? ? ?>>? ? ? ? ? ? ?I run the code. > > >? ? ?>>? ? ? ? ? ? ?> I am using 0.21.3 version. What is yours? > > >? ? ?>>? ? ? ? ? ? ?> > > >? ? ?>>? ? ? ? ? ? ?> > > >? ? ?>>? ? ? ? ? ? ?> In [13]: import sklearn > > >? ? ?>>? ? ? ? ? ? ?> > > >? ? ?>>? ? ? ? ? ? ?> In [14]: sklearn.__version__ > > >? ? ?>>? ? ? ? ? ? ?> Out[14]: '0.21.3' > > >? ? ?>>? ? ? ? ? ? ?> > > >? ? ?>>? ? ? ? ? ? ?> > > >? ? ?>>? ? ? ? ? ? ?> Serafeim > > >? ? ?>>? ? ? ? ? ? ?> > > >? ? ?>>? ? ? ? ? ? ?> > > >? ? ?>>? ? ? ? ? ? ?> > > >? ? ?>>? ? ? ? ? ? ?>> On 9 Oct 2019, at 21:44, Beno?t Presles > > >? ? ?>>? ? ? ? ? ? ? > >? ? ?>>? ? ? ? ? ? ? > > >? ? ?>>? ? ? ? ? ? ?>> > >? ? ?>>? ? ? ? ? ? ?>> wrote: > > >? ? ?>>? ? ? ? ? ? ?>> > > >? ? ?>>? ? ? ? ? ? ?>> (y_pred_lbfgs==y_pred_saga).all() == False > > >? ? ?>>? ? ? ? ? ? ?> > > >? ? ?>>? ? ? ? ? ? ?> > > >? ? ?>>? ? ? ? ? ? ?> > _______________________________________________ > > >? ? ?>>? ? ? ? ? ? ?> scikit-learn mailing list > > >? ? ?>>? ? ? ? ? ? ?> scikit-learn at python.org > scikit-learn at python.org> > > >? ? ?>>? ? ? ? ? ? ?> > > https://mail.python.org/mailman/listinfo/scikit-learn > > >? ? ?>>? ? ? ? ? ? ?> > > >? ? ?>>? ? ? ? ? ? ?_______________________________________________ > > >? ? ?>>? ? ? ? ? ? ?scikit-learn mailing list > > >? ? ?>>? ? ? ? ? ? ?scikit-learn at python.org > scikit-learn at python.org> > > >? ? ?>>? ? ? ? ? ? ?https://mail.python.org/mailman/listinfo/ > scikit-learn > > >? ? ?>>? ? ? ? ?-- > > >? ? ?>>? ? ? ? ?Guillaume Lemaitre > > >? ? ?>>? ? ? ? ?Scikit-learn @ Inria Foundation > > >? ? ?>>? ? ? ? ?https://glemaitre.github.io/ > > >? ? ?>>? ? ?-- > > >? ? ?>>? ? ?Guillaume Lemaitre > > >? ? ?>>? ? ?Scikit-learn @ Inria Foundation > > >? ? ?>>? ? ?https://glemaitre.github.io/ > > >? ? ?>> -- > > >? ? ?>> Guillaume Lemaitre > > >? ? ?>> Scikit-learn @ Inria Foundation > > >? ? ?>> https://glemaitre.github.io/ > > >? ? ?>> _______________________________________________ > > >? ? ?>> scikit-learn mailing list > > >? ? ?>> scikit-learn at python.org > > >? ? ?>> https://mail.python.org/mailman/listinfo/scikit-learn > > >? ? ?> _______________________________________________ > > >? ? ?> scikit-learn mailing list > > >? ? ?> scikit-learn at python.org > > >? ? ?> https://mail.python.org/mailman/listinfo/scikit-learn > > >? ? ?-------------- next part -------------- > > >? ? ?An HTML attachment was scrubbed... > > >? ? ?URL: < > > http://mail.python.org/pipermail/scikit-learn/attachments/20191011/ > > >? ? ?a7052cd9/attachment-0001.html> > > >? ? ?------------------------------ > > >? ? ?Subject: Digest Footer > > >? ? ?_______________________________________________ > > >? ? ?scikit-learn mailing list > > >? ? ?scikit-learn at python.org > > >? ? ?https://mail.python.org/mailman/listinfo/scikit-learn > > >? ? ?------------------------------ > > >? ? ?End of scikit-learn Digest, Vol 43, Issue 21 > > >? ? ?******************************************** > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > -- > >? ? ?Gael Varoquaux > >? ? ?Research Director, INRIA? ? ? ? ? ? ? Visiting professor, McGill > >? ? ?http://gael-varoquaux.info? ? ? ? ? ? http://twitter.com/ > GaelVaroquaux > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > ------------------------------ > > End of scikit-learn Digest, Vol 43, Issue 24 > > ******************************************** > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: 20191012/6959d075/attachment.html> > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: 2019-10-12 14_00_05-Frequently Asked Questions ? scikit-learn > 0.21.3 documentation.png > Type: image/png > Size: 26245 bytes > Desc: not available > URL: 20191012/6959d075/attachment.png> > ------------------------------ > Subject: Digest Footer > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > ------------------------------ > End of scikit-learn Digest, Vol 43, Issue 25 > ******************************************** > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Research Director, INRIA Visiting professor, McGill http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From jandre at lnec.pt Wed Oct 16 10:16:50 2019 From: jandre at lnec.pt (=?UTF-8?B?Sm/Do28gQW5kcsOp?=) Date: Wed, 16 Oct 2019 15:16:50 +0100 Subject: [scikit-learn] scikit-learn Digest, Vol 43, Issue 25 In-Reply-To: <20191016140240.3c227ljcvdplzvmb@phare.normalesup.org> References: <20191016140240.3c227ljcvdplzvmb@phare.normalesup.org> Message-ID: Dear Scikit-learn, This is my first message in this community! I make it because I think "model complexity" and "model prediction" are two separate "properties", which cannot in principle be directly compared. This is because one variable is missing, which is the data. If the initial data set corresponds to the entire true range of possible data, then I would say complex models will "model" the variable being studied with a prediction accuracy equal or better than any other "less complex" model. If the data set is not representative, then you might overfit with more complex models and there is a chance that more simple models will predict better for unseen sets of data. Therefore, the quality of the data is critical to judge how good will your model be. Hope this helps. Jo?o Jo?o Andr? Civil Engineer, M.Sc., Ph.D. Structures Department National Laboratory for Civil Engineering LNEC, Av. Brasil 101, 1700-066 Lisbon, Portugal Web: http://www.lnec.pt/ Skype ID: jpcgandre Phone: (+351) 218 443 355 On Wed, 16 Oct 2019 at 15:05, Gael Varoquaux wrote: > On Sun, Oct 13, 2019 at 07:40:11PM +0900, Brown J.B. via scikit-learn > wrote: > > Please, respect and refinement when addressing the contributors and > users of > > scikit-learn. > > I believe that Mike simply misread. It's something that happens (it > happens a lot to me). > > No harm on my side, and thanks for clarifying my overly short reply. > > G > > > Gael's statement is perfect -- complexity does not imply better > prediction. > > The choice of estimator (and algorithm) depends on the structure of the > model > > desired for the data presented. > > Estimator superiority cannot be proven in a context- and/or data-agnostic > > fashion. > > > J.B. > > > > 2019?10?13?(?) 6:13 Mike Smith : > > > "Second complexity does not > > > imply better prediction. " > > > Complexity doesn't imply prediction? Perhaps you're having a > translation > > error. > > > On Sat, Oct 12, 2019 at 2:04 PM > wrote: > > > Send scikit-learn mailing list submissions to > > scikit-learn at python.org > > > To subscribe or unsubscribe via the World Wide Web, visit > > https://mail.python.org/mailman/listinfo/scikit-learn > > or, via email, send a message with subject or body 'help' to > > scikit-learn-request at python.org > > > You can reach the person managing the list at > > scikit-learn-owner at python.org > > > When replying, please edit your Subject line so it is more > specific > > than "Re: Contents of scikit-learn digest..." > > > > Today's Topics: > > > 1. Re: scikit-learn Digest, Vol 43, Issue 24 (Mike Smith) > > > > > ---------------------------------------------------------------------- > > > Message: 1 > > Date: Sat, 12 Oct 2019 14:04:12 -0700 > > From: Mike Smith > > To: scikit-learn at python.org > > Subject: Re: [scikit-learn] scikit-learn Digest, Vol 43, Issue 24 > > Message-ID: > > > 4LRy2NJvjwvVr4RgobQ at mail.gmail.com> > > Content-Type: text/plain; charset="utf-8" > > > "... > If I should expect good results on a pc, scikit says that > > needing > > gpu power is > > > obsolete, since certain scikit models perform better (than ml > > designed > > for gpu) > > > that are not designed for gpu, for that reason. Is this true?" > > > Where do you see this written? I think that you are looking for > overly > > simple stories that you are not true." > > > Gael, see the below from the scikit-learn FAQ. You can also find > this > > yourself at the main FAQ: > > > [image: 2019-10-12 14_00_05-Frequently Asked Questions ? > scikit-learn > > 0.21.3 documentation.png] > > > > On Sat, Oct 12, 2019 at 9:03 AM > > > wrote: > > > > Send scikit-learn mailing list submissions to > > > scikit-learn at python.org > > > > To subscribe or unsubscribe via the World Wide Web, visit > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > or, via email, send a message with subject or body 'help' to > > > scikit-learn-request at python.org > > > > You can reach the person managing the list at > > > scikit-learn-owner at python.org > > > > When replying, please edit your Subject line so it is more > specific > > > than "Re: Contents of scikit-learn digest..." > > > > > Today's Topics: > > > > 1. Re: Is scikit-learn implying neural nets are the best > > > regressor? (Gael Varoquaux) > > > > > > ---------------------------------------------------------------------- > > > > Message: 1 > > > Date: Fri, 11 Oct 2019 13:34:33 -0400 > > > From: Gael Varoquaux > > > To: Scikit-learn mailing list > > > Subject: Re: [scikit-learn] Is scikit-learn implying neural > nets are > > > the best regressor? > > > Message-ID: < > 20191011173433.bbywiqnwjjpvsi4r at phare.normalesup.org> > > > Content-Type: text/plain; charset=iso-8859-1 > > > > On Fri, Oct 11, 2019 at 10:10:32AM -0700, Mike Smith wrote: > > > > In other words, according to that arrangement, is > scikit-learn > > implying > > > that > > > > section 1.17 is the best regressor out of the listed, 1.1 to > 1.17? > > > > No. > > > > First they are not ordered in order of complexity (Naive Bayes > is > > > arguably simpler than Gaussian Processes). Second complexity > does not > > > imply better prediction. > > > > > If I should expect good results on a pc, scikit says that > needing > > gpu > > > power is > > > > obsolete, since certain scikit models perform better (than ml > > designed > > > for gpu) > > > > that are not designed for gpu, for that reason. Is this true? > > > > Where do you see this written? I think that you are looking for > > overly > > > simple stories that you are not true. > > > > > How much hardware is a practical expectation for running the > best > > > > scikit models and getting the best results? > > > > This is too vague a question for which there is no answer. > > > > Ga?l > > > > > On Fri, Oct 11, 2019 at 9:02 AM < > scikit-learn-request at python.org> > > wrote: > > > > > Send scikit-learn mailing list submissions to > > > > ? ? ? ? scikit-learn at python.org > > > > > To subscribe or unsubscribe via the World Wide Web, visit > > > > ? ? ? ? > https://mail.python.org/mailman/listinfo/scikit-learn > > > > or, via email, send a message with subject or body > 'help' to > > > > ? ? ? ? scikit-learn-request at python.org > > > > > You can reach the person managing the list at > > > > ? ? ? ? scikit-learn-owner at python.org > > > > > When replying, please edit your Subject line so it is > more > > specific > > > > than "Re: Contents of scikit-learn digest..." > > > > > > Today's Topics: > > > > > ? ?1. Re: logistic regression results are not stable > between > > > > ? ? ? solvers (Andreas Mueller) > > > > > > > > > ---------------------------------------------------------------------- > > > > > Message: 1 > > > > Date: Fri, 11 Oct 2019 15:42:58 +0200 > > > > From: Andreas Mueller > > > > To: scikit-learn at python.org > > > > Subject: Re: [scikit-learn] logistic regression results > are not > > > stable > > > > ? ? ? ? between solvers > > > > Message-ID: < > d55949d6-3355-f892-f6b3-030edf1c7947 at gmail.com> > > > > Content-Type: text/plain; charset="utf-8"; > Format="flowed" > > > > > > > On 10/10/19 1:14 PM, Beno?t Presles wrote: > > > > > > Thanks for your answers. > > > > > > On my real data, I do not have so many samples. I have > a bit > > more > > > than > > > > > 200 samples in total and I also would like to get some > > results with > > > > > unpenalized logisitic regression. > > > > > What do you suggest? Should I switch to the lbfgs > solver? > > > > Yes. > > > > > Am I sure that with this solver I will not have any > > convergence > > > issue > > > > > and always get the good result? Indeed, I did not get > any > > > convergence > > > > > warning with saga, so I thought everything was fine. I > > noticed some > > > > > issues only when I decided to test several solvers. > Without > > > comparing > > > > > the results across solvers, how to be sure that the > > optimisation > > > goes > > > > > well? Shouldn't scikit-learn warn the user somehow if > it is > > not > > > the case? > > > > We should attempt to warn in the SAGA solver if it > doesn't > > converge. > > > > That it doesn't raise a convergence warning should > probably be > > > > considered a bug. > > > > It uses the maximum weight change as a stopping > criterion right > > now. > > > > We could probably compute the dual objective once in the > end to > > see > > > if > > > > we converged, right? Or is that not possible with SAGA? > If not, > > we > > > might > > > > want to caution that no convergence warning will be > raised. > > > > > > > At last, I was using saga because I also wanted to do > some > > feature > > > > > selection by using l1 penalty which is not supported by > > lbfgs... > > > > You can use liblinear then. > > > > > > > > Best regards, > > > > > Ben > > > > > > > Le 09/10/2019 ? 23:39, Guillaume Lema?tre a ?crit?: > > > > >> Ups I did not see the answer of Roman. Sorry about > that. It > > is > > > coming > > > > >> back to the same conclusion :) > > > > > >> On Wed, 9 Oct 2019 at 23:37, Guillaume Lema?tre > > > > >> g.lemaitre58 at gmail.com>> > > wrote: > > > > > >>? ? ?Uhm actually increasing to 10000 samples solve the > > convergence > > > > issue. > > > > >>? ? ?SAGA is not designed to work with a so small > sample size > > most > > > > >>? ? ?probably. > > > > > >>? ? ?On Wed, 9 Oct 2019 at 23:36, Guillaume Lema?tre > > > > >>? ? ? g.lemaitre58 at gmail.com>> > > > wrote: > > > > > >>? ? ? ? ?I slightly change the bench such that it uses > > pipeline and > > > > >>? ? ? ? ?plotted the coefficient: > > > > > >>? ? ? ? ?https://gist.github.com/glemaitre/ > > > > 8fcc24bdfc7dc38ca0c09c56e26b9386 > > > > > >>? ? ? ? ?I only see one of the 10 splits where SAGA is > not > > > converging, > > > > >>? ? ? ? ?otherwise the coefficients > > > > >>? ? ? ? ?look very close (I don't attach the figure > here but > > they > > > can > > > > >>? ? ? ? ?be plotted using the snippet). > > > > >>? ? ? ? ?So apart from this second split, the other > > differences > > > seems > > > > >>? ? ? ? ?to be numerical instability. > > > > > >>? ? ? ? ?Where I have some concern is regarding the > > convergence > > > rate > > > > >>? ? ? ? ?of SAGA but I have no > > > > >>? ? ? ? ?intuition to know if this is normal or not. > > > > > >>? ? ? ? ?On Wed, 9 Oct 2019 at 23:22, Roman Yurchak > > > > >>? ? ? ? ? rth.yurchak at gmail.com > > > > wrote: > > > > > >>? ? ? ? ? ? ?Ben, > > > > > >>? ? ? ? ? ? ?I can confirm your results with > penalty='none' > > and > > > C=1e9. > > > > >>? ? ? ? ? ? ?In both cases, > > > > >>? ? ? ? ? ? ?you are running a mostly unpenalized > logisitic > > > > >>? ? ? ? ? ? ?regression. Usually > > > > >>? ? ? ? ? ? ?that's less numerically stable than with > a small > > > > >>? ? ? ? ? ? ?regularization, > > > > >>? ? ? ? ? ? ?depending on the data collinearity. > > > > > >>? ? ? ? ? ? ?Running that same code with > > > > >>? ? ? ? ? ? ?? - larger penalty ( smaller C values) > > > > >>? ? ? ? ? ? ?? - or larger number of samples > > > > >>? ? ? ? ? ? ?? yields for me the same coefficients (up > to > > some > > > > tolerance). > > > > > >>? ? ? ? ? ? ?You can also see that SAGA convergence is > not > > good by > > > the > > > > >>? ? ? ? ? ? ?fact that it > > > > >>? ? ? ? ? ? ?needs 196000 epochs/iterations to > converge. > > > > > >>? ? ? ? ? ? ?Actually, I have often seen convergence > issues > > with > > > SAG > > > > >>? ? ? ? ? ? ?on small > > > > >>? ? ? ? ? ? ?datasets (in unit tests), not fully sure > why. > > > > > >>? ? ? ? ? ? ?-- > > > > >>? ? ? ? ? ? ?Roman > > > > > >>? ? ? ? ? ? ?On 09/10/2019 22:10, serafim loukas wrote: > > > > >>? ? ? ? ? ? ?> The predictions across solver are > exactly the > > same > > > when > > > > >>? ? ? ? ? ? ?I run the code. > > > > >>? ? ? ? ? ? ?> I am using 0.21.3 version. What is > yours? > > > > >>? ? ? ? ? ? ?> > > > > >>? ? ? ? ? ? ?> > > > > >>? ? ? ? ? ? ?> In [13]: import sklearn > > > > >>? ? ? ? ? ? ?> > > > > >>? ? ? ? ? ? ?> In [14]: sklearn.__version__ > > > > >>? ? ? ? ? ? ?> Out[14]: '0.21.3' > > > > >>? ? ? ? ? ? ?> > > > > >>? ? ? ? ? ? ?> > > > > >>? ? ? ? ? ? ?> Serafeim > > > > >>? ? ? ? ? ? ?> > > > > >>? ? ? ? ? ? ?> > > > > >>? ? ? ? ? ? ?> > > > > >>? ? ? ? ? ? ?>> On 9 Oct 2019, at 21:44, Beno?t Presles > > > > >>? ? ? ? ? ? ? > > > >>? ? ? ? ? ? ? > > > > >>? ? ? ? ? ? ?>> > > > >>? ? ? ? ? ? ?>> > wrote: > > > > >>? ? ? ? ? ? ?>> > > > > >>? ? ? ? ? ? ?>> (y_pred_lbfgs==y_pred_saga).all() == > False > > > > >>? ? ? ? ? ? ?> > > > > >>? ? ? ? ? ? ?> > > > > >>? ? ? ? ? ? ?> > > _______________________________________________ > > > > >>? ? ? ? ? ? ?> scikit-learn mailing list > > > > >>? ? ? ? ? ? ?> scikit-learn at python.org > > scikit-learn at python.org> > > > > >>? ? ? ? ? ? ?> > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > >>? ? ? ? ? ? ?> > > > > > >>? ? ? ? ? ? > ?_______________________________________________ > > > > >>? ? ? ? ? ? ?scikit-learn mailing list > > > > >>? ? ? ? ? ? ?scikit-learn at python.org > > scikit-learn at python.org> > > > > >>? ? ? ? ? ? ?https://mail.python.org/mailman/listinfo/ > > scikit-learn > > > > > > > >>? ? ? ? ?-- > > > > >>? ? ? ? ?Guillaume Lemaitre > > > > >>? ? ? ? ?Scikit-learn @ Inria Foundation > > > > >>? ? ? ? ?https://glemaitre.github.io/ > > > > > > > >>? ? ?-- > > > > >>? ? ?Guillaume Lemaitre > > > > >>? ? ?Scikit-learn @ Inria Foundation > > > > >>? ? ?https://glemaitre.github.io/ > > > > > > > >> -- > > > > >> Guillaume Lemaitre > > > > >> Scikit-learn @ Inria Foundation > > > > >> https://glemaitre.github.io/ > > > > > >> _______________________________________________ > > > > >> scikit-learn mailing list > > > > >> scikit-learn at python.org > > > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > > > scikit-learn mailing list > > > > > scikit-learn at python.org > > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -------------- next part -------------- > > > > An HTML attachment was scrubbed... > > > > URL: < > > > > http://mail.python.org/pipermail/scikit-learn/attachments/20191011/ > > > > a7052cd9/attachment-0001.html> > > > > > ------------------------------ > > > > > Subject: Digest Footer > > > > > _______________________________________________ > > > > scikit-learn mailing list > > > > scikit-learn at python.org > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > ------------------------------ > > > > > End of scikit-learn Digest, Vol 43, Issue 21 > > > > ******************************************** > > > > > > _______________________________________________ > > > > scikit-learn mailing list > > > > scikit-learn at python.org > > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -- > > > Gael Varoquaux > > > Research Director, INRIA Visiting professor, > McGill > > > http://gael-varoquaux.info http://twitter.com/ > > GaelVaroquaux > > > > > ------------------------------ > > > > Subject: Digest Footer > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > ------------------------------ > > > > End of scikit-learn Digest, Vol 43, Issue 24 > > > ******************************************** > > > -------------- next part -------------- > > An HTML attachment was scrubbed... > > URL: > 20191012/6959d075/attachment.html> > > -------------- next part -------------- > > A non-text attachment was scrubbed... > > Name: 2019-10-12 14_00_05-Frequently Asked Questions ? > scikit-learn > > 0.21.3 documentation.png > > Type: image/png > > Size: 26245 bytes > > Desc: not available > > URL: > 20191012/6959d075/attachment.png> > > > ------------------------------ > > > Subject: Digest Footer > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > ------------------------------ > > > End of scikit-learn Digest, Vol 43, Issue 25 > > ******************************************** > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > Gael Varoquaux > Research Director, INRIA Visiting professor, McGill > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kaltenb at stanford.edu Wed Oct 16 19:34:59 2019 From: kaltenb at stanford.edu (Kristen M. Altenburger) Date: Wed, 16 Oct 2019 23:34:59 +0000 Subject: [scikit-learn] Weighted Random Forest vs. "class_weight" in RandomForestClassifier Message-ID: Hi All, Posted the same question on StackExchange [link] but also circulating here to see if someone knows :) I am confused whether the "class_weight" parameter in Python's sklearn's Random Forest Classifier (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) is equivalent to Chen/Breiman's notion of "Weighted Random Forest" described in Section 2.3 (https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf). In short, "Weighted Random Forest" will "...assign a weight to each class, with the minority class given larger weight (i.e., higher misclassification cost). The class weights are incorporated into the RF algorithm in two places. In the tree induction procedure, class weights are used to weight the Gini criterion for finding splits. In the terminal nodes of each tree, class weights are again taken into consideration. The class prediction of each terminal node is determined by ?weighted majority vote?; i.e., the weighted vote of a class is the weight for that class times the number of cases for that class at the terminal node. The final class prediction for RF is then determined by aggregatting the weighted vote from each individual tree, where the weights are average weights in the terminal nodes." Question: I can't tell from the Python source code for RandomForestClassifier, is class_weight used to weight the Gini criterion for finding splits? And if not, can anyone recommend code that implements Weighted Random Forest? Thanks! Thanks! Kristen http://kaltenburger.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From ahmadqassemi at gmail.com Sat Oct 19 14:43:45 2019 From: ahmadqassemi at gmail.com (ahmad qassemi) Date: Sat, 19 Oct 2019 11:43:45 -0700 Subject: [scikit-learn] question Message-ID: Dear Mr/Mrs, I'm a PhD student in DS. I'm trying to use your provided code on *Spectral CoClustering *and *Spectral Biclustering* to bi-cluster my data matrix ( https://scikit-learn.org/stable/modules/biclustering.html). Since my data has complex values, i.e., matrix elements are complex, your modules don't work on my data. It seems that the reason is your K-means' code doesn't work with complex numbers. I will really appreciate it if you take some time and tell me how should I apply your codes on my complex data. Thanks a lot in advance. Sincerely, Ahmad Qassemi -------------- next part -------------- An HTML attachment was scrubbed... URL: From vaggi.federico at gmail.com Sat Oct 19 18:48:40 2019 From: vaggi.federico at gmail.com (federico vaggi) Date: Sat, 19 Oct 2019 15:48:40 -0700 Subject: [scikit-learn] question In-Reply-To: References: Message-ID: Your options are to either pick a clustering algorithm that supports a pre-computed distance matrix, or, find some kind of projection from C -> R, embed your data in R, then cluster your embedded data and transfer the labels back to C. On Sat, Oct 19, 2019 at 11:44 AM ahmad qassemi wrote: > Dear Mr/Mrs, > > I'm a PhD student in DS. I'm trying to use your provided code on *Spectral > CoClustering *and *Spectral Biclustering* to bi-cluster my data matrix ( > https://scikit-learn.org/stable/modules/biclustering.html). Since my data > has complex values, i.e., matrix elements are complex, your modules don't > work on my data. It seems that the reason is your K-means' code doesn't > work with complex numbers. I will really appreciate it if you take some > time and tell me how should I apply your codes on my complex data. Thanks a > lot in advance. > > Sincerely, > Ahmad Qassemi > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fernando.wittmann at gmail.com Sun Oct 20 09:54:58 2019 From: fernando.wittmann at gmail.com (Fernando Marcos Wittmann) Date: Sun, 20 Oct 2019 10:54:58 -0300 Subject: [scikit-learn] question In-Reply-To: References: Message-ID: What about converting into two columns? One with the real projection and the other with the complex projection? On Sat, Oct 19, 2019, 3:44 PM ahmad qassemi wrote: > Dear Mr/Mrs, > > I'm a PhD student in DS. I'm trying to use your provided code on *Spectral > CoClustering *and *Spectral Biclustering* to bi-cluster my data matrix ( > https://scikit-learn.org/stable/modules/biclustering.html). Since my data > has complex values, i.e., matrix elements are complex, your modules don't > work on my data. It seems that the reason is your K-means' code doesn't > work with complex numbers. I will really appreciate it if you take some > time and tell me how should I apply your codes on my complex data. Thanks a > lot in advance. > > Sincerely, > Ahmad Qassemi > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From seralouk at hotmail.com Sun Oct 20 10:07:37 2019 From: seralouk at hotmail.com (serafim loukas) Date: Sun, 20 Oct 2019 14:07:37 +0000 Subject: [scikit-learn] question In-Reply-To: References: Message-ID: I would take the magnitude. Otherwise you will have to modify the source code to make it work with complex values. Bests, Makis On Oct 20, 2019, at 15:55, Fernando Marcos Wittmann wrote: ? What about converting into two columns? One with the real projection and the other with the complex projection? On Sat, Oct 19, 2019, 3:44 PM ahmad qassemi > wrote: Dear Mr/Mrs, I'm a PhD student in DS. I'm trying to use your provided code on Spectral CoClustering and Spectral Biclustering to bi-cluster my data matrix (https://scikit-learn.org/stable/modules/biclustering.html). Since my data has complex values, i.e., matrix elements are complex, your modules don't work on my data. It seems that the reason is your K-means' code doesn't work with complex numbers. I will really appreciate it if you take some time and tell me how should I apply your codes on my complex data. Thanks a lot in advance. Sincerely, Ahmad Qassemi _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From ahmadqassemi at gmail.com Sun Oct 20 11:08:03 2019 From: ahmadqassemi at gmail.com (ahmad qassemi) Date: Sun, 20 Oct 2019 11:08:03 -0400 Subject: [scikit-learn] question In-Reply-To: References: Message-ID: Thanks a lot guys for your great hints. I've tested to consider only the magnitude or only the phase to attain the goal, but those don't work in my case, I should consider both simultaneously to get a correct result. Also, I've considered converting into two columns (imaginary + real columns). But, the problem is that after bi-clustering, imaginary columns with their corresponding real columns can be in different clusters and the problem arises how to assign them to a similar clusters. What I'm saying is that for each complex value, most likely real and imag part would be in different clusters and it's not easy to retrieve them to be in a same cluster. What do you think? Is it possible to modify Scikit-learn code to work with complex values? Or ...? Virus-free. www.avast.com <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> On Sun, 20 Oct 2019 at 10:09, serafim loukas wrote: > I would take the magnitude. > Otherwise you will have to modify the source code to make it work with > complex values. > > Bests, > Makis > > On Oct 20, 2019, at 15:55, Fernando Marcos Wittmann < > fernando.wittmann at gmail.com> wrote: > > ? > What about converting into two columns? One with the real projection and > the other with the complex projection? > > On Sat, Oct 19, 2019, 3:44 PM ahmad qassemi > wrote: > >> Dear Mr/Mrs, >> >> I'm a PhD student in DS. I'm trying to use your provided code on *Spectral >> CoClustering *and *Spectral Biclustering* to bi-cluster my data matrix ( >> https://scikit-learn.org/stable/modules/biclustering.html). Since my >> data has complex values, i.e., matrix elements are complex, your modules >> don't work on my data. It seems that the reason is your K-means' code >> doesn't work with complex numbers. I will really appreciate it if you take >> some time and tell me how should I apply your codes on my complex data. >> Thanks a lot in advance. >> >> Sincerely, >> Ahmad Qassemi >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jason at refinerynet.com Mon Oct 21 20:27:38 2019 From: jason at refinerynet.com (Jason Wolosonovich) Date: Mon, 21 Oct 2019 17:27:38 -0700 Subject: [scikit-learn] Sparse Input for HistGradientBoostingClassifier Message-ID: Hi! I'm getting an error when trying to use the HistGradientBoostingClassifier by feeding it the output from CountVectorizer and then TfidfTransformer. The error is: TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array. I haven't opened an issue yet because I wanted to get more clarification on whether this just isn't implemented yet or if there is some reason inherent to histogram based boosting that prevents sparse inputs from being used. Making the array dense in my case causes me to run out of memory. Thanks in advance! -Jason -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomasjpfan at gmail.com Mon Oct 21 21:50:14 2019 From: thomasjpfan at gmail.com (thomasjpfan at gmail.com) Date: Mon, 21 Oct 2019 21:50:14 -0400 Subject: [scikit-learn] Sparse Input for HistGradientBoostingClassifier In-Reply-To: References: Message-ID: Currently, it is not implemented. Feel free to open an issue regarding sparse support for HistGradientBoosting. Thomas > On Oct 21, 2019, at 9:00 PM, Jason Wolosonovich wrote: > > ? > Hi! > > I'm getting an error when trying to use the HistGradientBoostingClassifier by feeding it the output from CountVectorizer and then TfidfTransformer. The error is: > > TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array. > > I haven't opened an issue yet because I wanted to get more clarification on whether this just isn't implemented yet or if there is some reason inherent to histogram based boosting that prevents sparse inputs from being used. > > Making the array dense in my case causes me to run out of memory. Thanks in advance! > > -Jason > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From geoffrey.bolmier at gmail.com Tue Oct 22 05:32:43 2019 From: geoffrey.bolmier at gmail.com (Geoffrey Bolmier) Date: Tue, 22 Oct 2019 11:32:43 +0200 Subject: [scikit-learn] Decision tree results sometimes different with scaled data Message-ID: <04CA54E6-6894-4104-BB66-9A9FE89EED7F@getmailspring.com> Hi all, First, let me thank you for the great job your guys are doing developing and maintaining such a popular library! As we all know decision trees are not impacted by scaled data because splits don't take into account distances between two values within a feature. However I experienced a strange behavior using sklearn decision tree algorithm. Sometimes results of the model are different depending if input data has been scaled or not. To illustrate my point I ran experiments on the iris dataset consisting of: perform a train/test split fit the training set and predict the test set fit and predict again with standardized inputs (removing the mean and scaling to unit variance) compare both model predictions Experiments have been ran 10,000 times with different random seeds (cf. traceback and code to reproduce it at the end). Results showed that for a bit more than 10% of the time we find at least one different prediction. Hopefully when it's the case only a few predictions differ, 1 or 2 most of the time. I checked the inputs causing different predictions and they are not the same depending of the run. I'm worried if the rate of different predictions could be larger for other datasets... Do you have an idea where it come from, maybe due to floating point errors or am I doing something wrong? Cheers, Geoffrey ------------------------------------------------------------ Traceback: ------------------------------------------------------------ Error rate: 12.22% Seed: 241862 All pred equal: False Not scale data confusion matrix: [[16 0 0] [ 0 17 0] [ 0 4 13]] Scale data confusion matrix: [[16 0 0] [ 0 15 2] [ 0 4 13]] ------------------------------------------------------------ Code: ------------------------------------------------------------ import numpy as np from sklearn.datasets import load_iris from sklearn.metrics import confusion_matrix from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.tree import DecisionTreeClassifier X, y = load_iris(return_X_y=True) def run_experiment(X, y, seed): X_train, X_test, y_train, y_test = train_test_split( X, y, stratify=y, test_size=0.33, random_state=seed ) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) clf = DecisionTreeClassifier(random_state=seed) clf_scaled = DecisionTreeClassifier(random_state=seed) clf.fit(X_train, y_train) clf_scaled.fit(X_train_scaled, y_train) pred = clf.predict(X_test) pred_scaled = clf_scaled.predict(X_test_scaled) err = 0 if all(pred == pred_scaled) else 1 return err, y_test, pred, pred_scaled n_err, n_run, seed_err = 0, 10000, None for _ in range(n_run): seed = np.random.randint(10000000) err, _, _, _ = run_experiment(X, y, seed) n_err += err # keep aside last seed causing an error seed_err = seed if err == 1 else seed_err print(f'Error rate: {round(n_err / n_run * 100, 2)}%', end='\n\n') _, y_test, pred, pred_scaled = run_experiment(X, y, seed_err) print(f'Seed: {seed_err}') print(f'All pred equal: {all(pred == pred_scaled)}') print(f'Not scale data confusion matrix:\n{confusion_matrix(y_test, pred)}') print(f'Scale data confusion matrix:\n{confusion_matrix(y_test, pred_scaled)}') -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Tue Oct 22 05:49:55 2019 From: g.lemaitre58 at gmail.com (=?ISO-8859-1?Q?Guillaume_Lema=EEtre?=) Date: Tue, 22 Oct 2019 11:49:55 +0200 Subject: [scikit-learn] Decision tree results sometimes different with scaled data In-Reply-To: <04CA54E6-6894-4104-BB66-9A9FE89EED7F@getmailspring.com> Message-ID: <94p1c1979ovfis13811raa5n.1571737795840@gmail.com> An HTML attachment was scrubbed... URL: From podkanowicz.bartosz at gmail.com Wed Oct 23 04:53:46 2019 From: podkanowicz.bartosz at gmail.com (Bartosz Podkanowicz) Date: Wed, 23 Oct 2019 10:53:46 +0200 Subject: [scikit-learn] New contribution Message-ID: Hi all, I am not sure if it is right place for asking this question. In the next 3 months I would like to contribute to scikit-learn (some code, bug fixs, tests). Unfortunately I have found that many issues labelled as "good first issue" and "help wanted" has people already working on it or some pull requests like #15076, #14781, #14934. I am not sure if starting contibuting with reviewing pull requests is good idea. Can someone guide me how to find issues/enhancements/fixes that I can start working on? or maybe I should start with reviewing some pull requests? I suppose that I missed something when reading contributing guide. Thank you for answers. Kind Regards, Bartosz Podkanowicz -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Wed Oct 23 11:41:15 2019 From: adrin.jalali at gmail.com (Adrin) Date: Wed, 23 Oct 2019 17:41:15 +0200 Subject: [scikit-learn] New contribution In-Reply-To: References: Message-ID: Hi Bartosz, Glad to hear you're interested in contributing to scikit-learn. As you've observed, our labels are not up to date and we're still working on it. However, there are a few hints I can give for you to find places to start with: - Some old "good first issue"s require many separate PRs, and they're still open. Those are the ones which usually touch many files/classes. You can try to start from there, which also would help you since you'll have other accepted PRs as a template. - Some older PRs are labeled as "stalled", and you can find some easy ones and address the comments and continue their work. You can also mix your search to find the ones which are "stalled" and labelled as "sprint", which are usually easy ones, but abandoned by the original author. - "Watch" the repo for a while, and see how the activity goes on the repo, and you may be able to find a recently reported issue interesting before others claim it. I hope these hints help, and hope to see your contributions soon :) Best, Adrin. On Wed, Oct 23, 2019 at 10:55 AM Bartosz Podkanowicz < podkanowicz.bartosz at gmail.com> wrote: > Hi all, > > I am not sure if it is right place for asking this question. > > In the next 3 months I would like to contribute to scikit-learn (some > code, bug fixs, tests). Unfortunately I have found that many issues > labelled as "good first issue" and "help wanted" has people already working > on it or some pull requests like #15076, #14781, #14934. > > I am not sure if starting contibuting with reviewing pull requests is good > idea. > > Can someone guide me how to find issues/enhancements/fixes that I can > start working on? or maybe I should start with reviewing some pull requests? > > I suppose that I missed something when reading contributing guide. > > Thank you for answers. > > Kind Regards, > Bartosz Podkanowicz > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexandre.gramfort at inria.fr Thu Oct 24 08:09:01 2019 From: alexandre.gramfort at inria.fr (Alexandre Gramfort) Date: Thu, 24 Oct 2019 14:09:01 +0200 Subject: [scikit-learn] Decision tree results sometimes different with scaled data In-Reply-To: <94p1c1979ovfis13811raa5n.1571737795840@gmail.com> References: <04CA54E6-6894-4104-BB66-9A9FE89EED7F@getmailspring.com> <94p1c1979ovfis13811raa5n.1571737795840@gmail.com> Message-ID: another reason is that we take as threshold the mid point between sample values which is not invariant to arbitrary scaling of the features Alex On Tue, Oct 22, 2019 at 11:56 AM Guillaume Lema?tre wrote: > Even with the same random state, it can happen that several features will > lead to a best split and this split is chosen randomly (even with the seed > fixed - this is reported as an issue I think). Therefore, the rest of the > tree could be different leading to different prediction. > > Another possibility is that we compute the difference between the current > threshold and the next to be tried and only check the entropy if it is > larger than a specific value (I would need to check the source code). After > scaling, it could happen that 2 feature values become too closed to be > considered as a potential split which will make a difference between scaled > and scaled features. But this diff should be really small. > > This is the what I can think on the top of the head. > > Sent from my phone - sorry to be brief and potential misspell. > *From:* geoffrey.bolmier at gmail.com > *Sent:* 22 October 2019 11:34 > *To:* scikit-learn at python.org > *Reply to:* scikit-learn at python.org > *Subject:* [scikit-learn] Decision tree results sometimes different with > scaled data > > Hi all, > > First, let me thank you for the great job your guys are doing developing > and maintaining such a popular library! > > As we all know decision trees are not impacted by scaled data because > splits don't take into account distances between two values within a > feature. > > However I experienced a strange behavior using sklearn decision tree > algorithm. Sometimes results of the model are different depending if input > data has been scaled or not. > > To illustrate my point I ran experiments on the iris dataset consisting of: > > - perform a train/test split > - fit the training set and predict the test set > - fit and predict again with standardized inputs (removing the mean > and scaling to unit variance) > - compare both model predictions > > Experiments have been ran 10,000 times with different random seeds (cf. > traceback and code to reproduce it at the end). > Results showed that for a bit more than 10% of the time we find at least > one different prediction. Hopefully when it's the case only a few > predictions differ, 1 or 2 most of the time. I checked the inputs causing > different predictions and they are not the same depending of the run. > > I'm worried if the rate of different predictions could be larger for other > datasets... > Do you have an idea where it come from, maybe due to floating point errors > or am I doing something wrong? > > Cheers, > Geoffrey > > > ------------------------------------------------------------ > Traceback: > ------------------------------------------------------------ > Error rate: 12.22% > > Seed: 241862 > All pred equal: False > Not scale data confusion matrix: > [[16 0 0] > [ 0 17 0] > [ 0 4 13]] > Scale data confusion matrix: > [[16 0 0] > [ 0 15 2] > [ 0 4 13]] > ------------------------------------------------------------ > Code: > ------------------------------------------------------------ > import numpy as np > > from sklearn.datasets import load_iris > from sklearn.metrics import confusion_matrix > from sklearn.model_selection import train_test_split > from sklearn.preprocessing import StandardScaler > from sklearn.tree import DecisionTreeClassifier > > > X, y = load_iris(return_X_y=True) > > def run_experiment(X, y, seed): > X_train, X_test, y_train, y_test = train_test_split( > X, > y, > stratify=y, > test_size=0.33, > random_state=seed > ) > > scaler = StandardScaler() > > X_train_scaled = scaler.fit_transform(X_train) > X_test_scaled = scaler.transform(X_test) > > clf = DecisionTreeClassifier(random_state=seed) > clf_scaled = DecisionTreeClassifier(random_state=seed) > > clf.fit(X_train, y_train) > clf_scaled.fit(X_train_scaled, y_train) > > pred = clf.predict(X_test) > pred_scaled = clf_scaled.predict(X_test_scaled) > > err = 0 if all(pred == pred_scaled) else 1 > > return err, y_test, pred, pred_scaled > > > n_err, n_run, seed_err = 0, 10000, None > > for _ in range(n_run): > seed = np.random.randint(10000000) > err, _, _, _ = run_experiment(X, y, seed) > n_err += err > > # keep aside last seed causing an error > seed_err = seed if err == 1 else seed_err > > > print(f'Error rate: {round(n_err / n_run * 100, 2)}%', end='\n\n') > > _, y_test, pred, pred_scaled = run_experiment(X, y, seed_err) > > print(f'Seed: {seed_err}') > print(f'All pred equal: {all(pred == pred_scaled)}') > print(f'Not scale data confusion matrix:\n{confusion_matrix(y_test, > pred)}') > print(f'Scale data confusion matrix:\n{confusion_matrix(y_test, > pred_scaled)}') > [image: Sent from Mailspring] > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Thu Oct 24 11:10:26 2019 From: adrin.jalali at gmail.com (Adrin) Date: Thu, 24 Oct 2019 17:10:26 +0200 Subject: [scikit-learn] Reminder: Monday October 28th meeting Message-ID: Hi Scikit-learn people, This is a reminder that we'll be having our monthly call on Monday. Please put your thoughts and important topics you have in mind on the project board: https://github.com/scikit-learn/scikit-learn/projects/15 We'll be meeting on https://appear.in/amueller As usual, it'd be nice to have them on the board before the weekend :) See you on Monday, Adrin. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Wong.WingMei at UOBgroup.com Thu Oct 24 22:01:19 2019 From: Wong.WingMei at UOBgroup.com (WONG Wing Mei) Date: Fri, 25 Oct 2019 02:01:19 +0000 Subject: [scikit-learn] scikit-learn Digest, Vol 43, Issue 38 In-Reply-To: References: Message-ID: <132529746EA4F64D8BBA524718094D5B9970034B@ntxmbpsg02.SG.UOBNET.COM> Can I ask whether we can use sample weight in gradient boosting? And how to do it? -----Original Message----- From: scikit-learn [mailto:scikit-learn-bounces+wong.wingmei=uobgroup.com at python.org] On Behalf Of scikit-learn-request at python.org Sent: Friday, October 25, 2019 12:00 AM To: scikit-learn at python.org Subject: scikit-learn Digest, Vol 43, Issue 38 Send scikit-learn mailing list submissions to scikit-learn at python.org To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/scikit-learn or, via email, send a message with subject or body 'help' to scikit-learn-request at python.org You can reach the person managing the list at scikit-learn-owner at python.org When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..." Today's Topics: 1. Re: Decision tree results sometimes different with scaled data (Alexandre Gramfort) 2. Reminder: Monday October 28th meeting (Adrin) ---------------------------------------------------------------------- Message: 1 Date: Thu, 24 Oct 2019 14:09:01 +0200 From: Alexandre Gramfort To: Scikit-learn mailing list Subject: Re: [scikit-learn] Decision tree results sometimes different with scaled data Message-ID: Content-Type: text/plain; charset="utf-8" another reason is that we take as threshold the mid point between sample values which is not invariant to arbitrary scaling of the features Alex On Tue, Oct 22, 2019 at 11:56 AM Guillaume Lema?tre wrote: > Even with the same random state, it can happen that several features will > lead to a best split and this split is chosen randomly (even with the seed > fixed - this is reported as an issue I think). Therefore, the rest of the > tree could be different leading to different prediction. > > Another possibility is that we compute the difference between the current > threshold and the next to be tried and only check the entropy if it is > larger than a specific value (I would need to check the source code). After > scaling, it could happen that 2 feature values become too closed to be > considered as a potential split which will make a difference between scaled > and scaled features. But this diff should be really small. > > This is the what I can think on the top of the head. > > Sent from my phone - sorry to be brief and potential misspell. > *From:* geoffrey.bolmier at gmail.com > *Sent:* 22 October 2019 11:34 > *To:* scikit-learn at python.org > *Reply to:* scikit-learn at python.org > *Subject:* [scikit-learn] Decision tree results sometimes different with > scaled data > > Hi all, > > First, let me thank you for the great job your guys are doing developing > and maintaining such a popular library! > > As we all know decision trees are not impacted by scaled data because > splits don't take into account distances between two values within a > feature. > > However I experienced a strange behavior using sklearn decision tree > algorithm. Sometimes results of the model are different depending if input > data has been scaled or not. > > To illustrate my point I ran experiments on the iris dataset consisting of: > > - perform a train/test split > - fit the training set and predict the test set > - fit and predict again with standardized inputs (removing the mean > and scaling to unit variance) > - compare both model predictions > > Experiments have been ran 10,000 times with different random seeds (cf. > traceback and code to reproduce it at the end). > Results showed that for a bit more than 10% of the time we find at least > one different prediction. Hopefully when it's the case only a few > predictions differ, 1 or 2 most of the time. I checked the inputs causing > different predictions and they are not the same depending of the run. > > I'm worried if the rate of different predictions could be larger for other > datasets... > Do you have an idea where it come from, maybe due to floating point errors > or am I doing something wrong? > > Cheers, > Geoffrey > > > ------------------------------------------------------------ > Traceback: > ------------------------------------------------------------ > Error rate: 12.22% > > Seed: 241862 > All pred equal: False > Not scale data confusion matrix: > [[16 0 0] > [ 0 17 0] > [ 0 4 13]] > Scale data confusion matrix: > [[16 0 0] > [ 0 15 2] > [ 0 4 13]] > ------------------------------------------------------------ > Code: > ------------------------------------------------------------ > import numpy as np > > from sklearn.datasets import load_iris > from sklearn.metrics import confusion_matrix > from sklearn.model_selection import train_test_split > from sklearn.preprocessing import StandardScaler > from sklearn.tree import DecisionTreeClassifier > > > X, y = load_iris(return_X_y=True) > > def run_experiment(X, y, seed): > X_train, X_test, y_train, y_test = train_test_split( > X, > y, > stratify=y, > test_size=0.33, > random_state=seed > ) > > scaler = StandardScaler() > > X_train_scaled = scaler.fit_transform(X_train) > X_test_scaled = scaler.transform(X_test) > > clf = DecisionTreeClassifier(random_state=seed) > clf_scaled = DecisionTreeClassifier(random_state=seed) > > clf.fit(X_train, y_train) > clf_scaled.fit(X_train_scaled, y_train) > > pred = clf.predict(X_test) > pred_scaled = clf_scaled.predict(X_test_scaled) > > err = 0 if all(pred == pred_scaled) else 1 > > return err, y_test, pred, pred_scaled > > > n_err, n_run, seed_err = 0, 10000, None > > for _ in range(n_run): > seed = np.random.randint(10000000) > err, _, _, _ = run_experiment(X, y, seed) > n_err += err > > # keep aside last seed causing an error > seed_err = seed if err == 1 else seed_err > > > print(f'Error rate: {round(n_err / n_run * 100, 2)}%', end='\n\n') > > _, y_test, pred, pred_scaled = run_experiment(X, y, seed_err) > > print(f'Seed: {seed_err}') > print(f'All pred equal: {all(pred == pred_scaled)}') > print(f'Not scale data confusion matrix:\n{confusion_matrix(y_test, > pred)}') > print(f'Scale data confusion matrix:\n{confusion_matrix(y_test, > pred_scaled)}') > [image: Sent from Mailspring] > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Thu, 24 Oct 2019 17:10:26 +0200 From: Adrin To: Scikit-learn mailing list Subject: [scikit-learn] Reminder: Monday October 28th meeting Message-ID: Content-Type: text/plain; charset="utf-8" Hi Scikit-learn people, This is a reminder that we'll be having our monthly call on Monday. Please put your thoughts and important topics you have in mind on the project board: https://github.com/scikit-learn/scikit-learn/projects/15 We'll be meeting on https://appear.in/amueller As usual, it'd be nice to have them on the board before the weekend :) See you on Monday, Adrin. -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Subject: Digest Footer _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn ------------------------------ End of scikit-learn Digest, Vol 43, Issue 38 ******************************************** UOB EMAIL DISCLAIMER Any person receiving this email and any attachment(s) contained, shall treat the information as confidential and not misuse, copy, disclose, distribute or retain the information in any way that amounts to a breach of confidentiality. If you are not the intended recipient, please delete all copies of this email from your computer system. As the integrity of this message cannot be guaranteed, neither UOB nor any entity in the UOB Group shall be responsible for the contents. Any opinion in this email may not necessarily represent the opinion of UOB or any entity in the UOB Group. From adrin.jalali at gmail.com Fri Oct 25 03:39:09 2019 From: adrin.jalali at gmail.com (Adrin) Date: Fri, 25 Oct 2019 09:39:09 +0200 Subject: [scikit-learn] scikit-learn Digest, Vol 43, Issue 38 In-Reply-To: <132529746EA4F64D8BBA524718094D5B9970034B@ntxmbpsg02.SG.UOBNET.COM> References: <132529746EA4F64D8BBA524718094D5B9970034B@ntxmbpsg02.SG.UOBNET.COM> Message-ID: Hi, it's in the making: https://github.com/scikit-learn/scikit-learn/pull/14696 On Fri, Oct 25, 2019 at 4:23 AM WONG Wing Mei wrote: > Can I ask whether we can use sample weight in gradient boosting? And how > to do it? > > -----Original Message----- > From: scikit-learn [mailto:scikit-learn-bounces+wong.wingmei= > uobgroup.com at python.org] On Behalf Of scikit-learn-request at python.org > Sent: Friday, October 25, 2019 12:00 AM > To: scikit-learn at python.org > Subject: scikit-learn Digest, Vol 43, Issue 38 > > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Re: Decision tree results sometimes different with scaled > data (Alexandre Gramfort) > 2. Reminder: Monday October 28th meeting (Adrin) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 24 Oct 2019 14:09:01 +0200 > From: Alexandre Gramfort > To: Scikit-learn mailing list > Subject: Re: [scikit-learn] Decision tree results sometimes different > with scaled data > Message-ID: > < > CADeotZrh_bXHAqV6WDNRoUt4ZXW_+eObj6_vmwMA50AnahkxgA at mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > another reason is that we take as threshold the mid point between sample > values > which is not invariant to arbitrary scaling of the features > > Alex > > > > On Tue, Oct 22, 2019 at 11:56 AM Guillaume Lema?tre < > g.lemaitre58 at gmail.com> > wrote: > > > Even with the same random state, it can happen that several features will > > lead to a best split and this split is chosen randomly (even with the > seed > > fixed - this is reported as an issue I think). Therefore, the rest of the > > tree could be different leading to different prediction. > > > > Another possibility is that we compute the difference between the current > > threshold and the next to be tried and only check the entropy if it is > > larger than a specific value (I would need to check the source code). > After > > scaling, it could happen that 2 feature values become too closed to be > > considered as a potential split which will make a difference between > scaled > > and scaled features. But this diff should be really small. > > > > This is the what I can think on the top of the head. > > > > Sent from my phone - sorry to be brief and potential misspell. > > *From:* geoffrey.bolmier at gmail.com > > *Sent:* 22 October 2019 11:34 > > *To:* scikit-learn at python.org > > *Reply to:* scikit-learn at python.org > > *Subject:* [scikit-learn] Decision tree results sometimes different with > > scaled data > > > > Hi all, > > > > First, let me thank you for the great job your guys are doing developing > > and maintaining such a popular library! > > > > As we all know decision trees are not impacted by scaled data because > > splits don't take into account distances between two values within a > > feature. > > > > However I experienced a strange behavior using sklearn decision tree > > algorithm. Sometimes results of the model are different depending if > input > > data has been scaled or not. > > > > To illustrate my point I ran experiments on the iris dataset consisting > of: > > > > - perform a train/test split > > - fit the training set and predict the test set > > - fit and predict again with standardized inputs (removing the mean > > and scaling to unit variance) > > - compare both model predictions > > > > Experiments have been ran 10,000 times with different random seeds (cf. > > traceback and code to reproduce it at the end). > > Results showed that for a bit more than 10% of the time we find at least > > one different prediction. Hopefully when it's the case only a few > > predictions differ, 1 or 2 most of the time. I checked the inputs causing > > different predictions and they are not the same depending of the run. > > > > I'm worried if the rate of different predictions could be larger for > other > > datasets... > > Do you have an idea where it come from, maybe due to floating point > errors > > or am I doing something wrong? > > > > Cheers, > > Geoffrey > > > > > > ------------------------------------------------------------ > > Traceback: > > ------------------------------------------------------------ > > Error rate: 12.22% > > > > Seed: 241862 > > All pred equal: False > > Not scale data confusion matrix: > > [[16 0 0] > > [ 0 17 0] > > [ 0 4 13]] > > Scale data confusion matrix: > > [[16 0 0] > > [ 0 15 2] > > [ 0 4 13]] > > ------------------------------------------------------------ > > Code: > > ------------------------------------------------------------ > > import numpy as np > > > > from sklearn.datasets import load_iris > > from sklearn.metrics import confusion_matrix > > from sklearn.model_selection import train_test_split > > from sklearn.preprocessing import StandardScaler > > from sklearn.tree import DecisionTreeClassifier > > > > > > X, y = load_iris(return_X_y=True) > > > > def run_experiment(X, y, seed): > > X_train, X_test, y_train, y_test = train_test_split( > > X, > > y, > > stratify=y, > > test_size=0.33, > > random_state=seed > > ) > > > > scaler = StandardScaler() > > > > X_train_scaled = scaler.fit_transform(X_train) > > X_test_scaled = scaler.transform(X_test) > > > > clf = DecisionTreeClassifier(random_state=seed) > > clf_scaled = DecisionTreeClassifier(random_state=seed) > > > > clf.fit(X_train, y_train) > > clf_scaled.fit(X_train_scaled, y_train) > > > > pred = clf.predict(X_test) > > pred_scaled = clf_scaled.predict(X_test_scaled) > > > > err = 0 if all(pred == pred_scaled) else 1 > > > > return err, y_test, pred, pred_scaled > > > > > > n_err, n_run, seed_err = 0, 10000, None > > > > for _ in range(n_run): > > seed = np.random.randint(10000000) > > err, _, _, _ = run_experiment(X, y, seed) > > n_err += err > > > > # keep aside last seed causing an error > > seed_err = seed if err == 1 else seed_err > > > > > > print(f'Error rate: {round(n_err / n_run * 100, 2)}%', end='\n\n') > > > > _, y_test, pred, pred_scaled = run_experiment(X, y, seed_err) > > > > print(f'Seed: {seed_err}') > > print(f'All pred equal: {all(pred == pred_scaled)}') > > print(f'Not scale data confusion matrix:\n{confusion_matrix(y_test, > > pred)}') > > print(f'Scale data confusion matrix:\n{confusion_matrix(y_test, > > pred_scaled)}') > > [image: Sent from Mailspring] > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://mail.python.org/pipermail/scikit-learn/attachments/20191024/87feea0d/attachment-0001.html > > > > ------------------------------ > > Message: 2 > Date: Thu, 24 Oct 2019 17:10:26 +0200 > From: Adrin > To: Scikit-learn mailing list > Subject: [scikit-learn] Reminder: Monday October 28th meeting > Message-ID: > < > CAEOrW48htWpXLwZ2daKSbas5utepG6kc_XgrWWvTDoCVTD7oQw at mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Hi Scikit-learn people, > > This is a reminder that we'll be having our monthly call on Monday. > > Please put your thoughts and important topics you have in mind on > the project board: > https://github.com/scikit-learn/scikit-learn/projects/15 > > We'll be meeting on https://appear.in/amueller > > As usual, it'd be nice to have them on the board before the weekend :) > > See you on Monday, > Adrin. > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://mail.python.org/pipermail/scikit-learn/attachments/20191024/377798b6/attachment-0001.html > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 43, Issue 38 > ******************************************** > UOB EMAIL DISCLAIMER > Any person receiving this email and any attachment(s) contained, > shall treat the information as confidential and not misuse, copy, > disclose, distribute or retain the information in any way that > amounts to a breach of confidentiality. If you are not the intended > recipient, please delete all copies of this email from your computer > system. As the integrity of this message cannot be guaranteed, > neither UOB nor any entity in the UOB Group shall be responsible for > the contents. Any opinion in this email may not necessarily represent > the opinion of UOB or any entity in the UOB Group. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From niourf at gmail.com Fri Oct 25 07:31:38 2019 From: niourf at gmail.com (Nicolas Hug) Date: Fri, 25 Oct 2019 07:31:38 -0400 Subject: [scikit-learn] scikit-learn Digest, Vol 43, Issue 38 In-Reply-To: References: <132529746EA4F64D8BBA524718094D5B9970034B@ntxmbpsg02.SG.UOBNET.COM> Message-ID: It's in the making for the new histogram-based GB estimators, but the other GB estimators like GradientBoostingRegressor and GradientBoostingClassifier already support sample_weight. Just pass the weights in the fit method: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier.fit On 10/25/19 3:39 AM, Adrin wrote: > Hi, > > it's in the making: > https://github.com/scikit-learn/scikit-learn/pull/14696 > > > > On Fri, Oct 25, 2019 at 4:23 AM WONG Wing Mei > > wrote: > > Can I ask whether we can use sample weight in gradient boosting? > And how to do it? > > -----Original Message----- > From: scikit-learn [mailto:scikit-learn-bounces+wong.wingmei > =uobgroup.com at python.org > ] On Behalf Of > scikit-learn-request at python.org > > Sent: Friday, October 25, 2019 12:00 AM > To: scikit-learn at python.org > Subject: scikit-learn Digest, Vol 43, Issue 38 > > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > ? ?1. Re: Decision tree results sometimes different with scaled > ? ? ? data (Alexandre Gramfort) > ? ?2. Reminder: Monday October 28th meeting (Adrin) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 24 Oct 2019 14:09:01 +0200 > From: Alexandre Gramfort > > To: Scikit-learn mailing list > > Subject: Re: [scikit-learn] Decision tree results sometimes different > ? ? ? ? with scaled data > Message-ID: > ? ? ? ? > > > Content-Type: text/plain; charset="utf-8" > > another reason is that we take as threshold the mid point between > sample > values > which is not invariant to arbitrary scaling of the features > > Alex > > > > On Tue, Oct 22, 2019 at 11:56 AM Guillaume Lema?tre > > > wrote: > > > Even with the same random state, it can happen that several > features will > > lead to a best split and this split is chosen randomly (even > with the seed > > fixed - this is reported as an issue I think). Therefore, the > rest of the > > tree could be different leading to different prediction. > > > > Another possibility is that we compute the difference between > the current > > threshold and the next to be tried and only check the entropy if > it is > > larger than a specific value (I would need to check the source > code). After > > scaling, it could happen that 2 feature values become too closed > to be > > considered as a potential split which will make a difference > between scaled > > and scaled features. But this diff should be really small. > > > > This is the what I can think on the top of the head. > > > > Sent from my phone - sorry to be brief and potential misspell. > > *From:* geoffrey.bolmier at gmail.com > > > *Sent:* 22 October 2019 11:34 > > *To:* scikit-learn at python.org > > *Reply to:* scikit-learn at python.org > > *Subject:* [scikit-learn] Decision tree results sometimes > different with > > scaled data > > > > Hi all, > > > > First, let me thank you for the great job your guys are doing > developing > > and maintaining such a popular library! > > > > As we all know decision trees are not impacted by scaled data > because > > splits don't take into account distances between two values within a > > feature. > > > > However I experienced a strange behavior using sklearn decision tree > > algorithm.? Sometimes results of the model are different > depending if input > > data has been scaled or not. > > > > To illustrate my point I ran experiments on the iris dataset > consisting of: > > > >? ? - perform a train/test split > >? ? - fit the training set and predict the test set > >? ? - fit and predict again with standardized inputs (removing > the mean > >? ? and scaling to unit variance) > >? ? - compare both model predictions > > > > Experiments have been ran 10,000 times with different random > seeds (cf. > > traceback and code to reproduce it at the end). > > Results showed that for a bit more than 10% of the time we find > at least > > one different prediction. Hopefully when it's the case only a few > > predictions differ, 1 or 2 most of the time. I checked the > inputs causing > > different predictions and they are not the same depending of the > run. > > > > I'm worried if the rate of different predictions could be larger > for other > > datasets... > > Do you have an idea where it come from, maybe due to floating > point errors > > or am I doing something wrong? > > > > Cheers, > > Geoffrey > > > > > > ------------------------------------------------------------ > > Traceback: > > ------------------------------------------------------------ > > Error rate: 12.22% > > > > Seed: 241862 > > All pred equal: False > > Not scale data confusion matrix: > > [[16? 0? 0] > > [ 0 17? 0] > > [ 0? 4 13]] > > Scale data confusion matrix: > > [[16? 0? 0] > > [ 0 15? 2] > > [ 0? 4 13]] > > ------------------------------------------------------------ > > Code: > > ------------------------------------------------------------ > > import numpy as np > > > > from sklearn.datasets import load_iris > > from sklearn.metrics import confusion_matrix > > from sklearn.model_selection import train_test_split > > from sklearn.preprocessing import StandardScaler > > from sklearn.tree import DecisionTreeClassifier > > > > > > X, y = load_iris(return_X_y=True) > > > > def run_experiment(X, y, seed): > >? ? ?X_train, X_test, y_train, y_test = train_test_split( > >? ? ? ? ? ? ?X, > >? ? ? ? ? ? ?y, > >? ? ? ? ? ? ?stratify=y, > >? ? ? ? ? ? ?test_size=0.33, > >? ? ? ? ? ? ?random_state=seed > >? ? ? ? ?) > > > >? ? ?scaler = StandardScaler() > > > >? ? ?X_train_scaled = scaler.fit_transform(X_train) > >? ? ?X_test_scaled = scaler.transform(X_test) > > > >? ? ?clf = DecisionTreeClassifier(random_state=seed) > >? ? ?clf_scaled = DecisionTreeClassifier(random_state=seed) > > > >? ? ?clf.fit(X_train, y_train) > >? ? ?clf_scaled.fit(X_train_scaled, y_train) > > > >? ? ?pred = clf.predict(X_test) > >? ? ?pred_scaled = clf_scaled.predict(X_test_scaled) > > > >? ? ?err = 0 if all(pred == pred_scaled) else 1 > > > >? ? ?return err, y_test, pred, pred_scaled > > > > > > n_err, n_run, seed_err = 0, 10000, None > > > > for _ in range(n_run): > >? ? ?seed = np.random.randint(10000000) > >? ? ?err, _, _, _ = run_experiment(X, y, seed) > >? ? ?n_err += err > > > >? ? ?# keep aside last seed causing an error > >? ? ?seed_err = seed if err == 1 else seed_err > > > > > > print(f'Error rate: {round(n_err / n_run * 100, 2)}%', end='\n\n') > > > > _, y_test, pred, pred_scaled = run_experiment(X, y, seed_err) > > > > print(f'Seed: {seed_err}') > > print(f'All pred equal: {all(pred == pred_scaled)}') > > print(f'Not scale data confusion matrix:\n{confusion_matrix(y_test, > > pred)}') > > print(f'Scale data confusion matrix:\n{confusion_matrix(y_test, > > pred_scaled)}') > > [image: Sent from Mailspring] > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > > > ------------------------------ > > Message: 2 > Date: Thu, 24 Oct 2019 17:10:26 +0200 > From: Adrin > > To: Scikit-learn mailing list > > Subject: [scikit-learn] Reminder: Monday October 28th meeting > Message-ID: > ? ? ? ? > > > Content-Type: text/plain; charset="utf-8" > > Hi Scikit-learn people, > > This is a reminder that we'll be having our monthly call on Monday. > > Please put your thoughts and important topics you have in mind on > the project board: > https://github.com/scikit-learn/scikit-learn/projects/15 > > We'll be meeting on https://appear.in/amueller > > As usual, it'd be nice to have them on the board before the weekend :) > > See you on Monday, > Adrin. > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 43, Issue 38 > ******************************************** > UOB EMAIL DISCLAIMER > Any person receiving this email and any attachment(s) contained, > shall treat the information as confidential and not misuse, copy, > disclose, distribute or retain the information in any way that > amounts to a breach of confidentiality. If you are not the intended > recipient, please delete all copies of this email from your computer > system. As the integrity of this message cannot be guaranteed, > neither UOB nor any entity in the UOB Group shall be responsible for > the contents. Any opinion in this email may not necessarily represent > the opinion of UOB or any entity in the UOB Group. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sat Oct 26 08:17:14 2019 From: joel.nothman at gmail.com (Joel Nothman) Date: Sat, 26 Oct 2019 23:17:14 +1100 Subject: [scikit-learn] Reminder: Monday October 28th meeting In-Reply-To: References: Message-ID: Reminder: time is 12:00Z. https://www.timeanddate.com/worldclock/meetingdetails.html?year=2019&month=10&day=28&hour=12&min=0&sec=0&p1=240&p2=33&p3=37&p4=179&p5=195 On Fri., 25 Oct. 2019, 2:15 am Adrin, wrote: > Hi Scikit-learn people, > > This is a reminder that we'll be having our monthly call on Monday. > > Please put your thoughts and important topics you have in mind on > the project board: > https://github.com/scikit-learn/scikit-learn/projects/15 > > We'll be meeting on https://appear.in/amueller > > As usual, it'd be nice to have them on the board before the weekend :) > > See you on Monday, > Adrin. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From geobgeo at yahoo.com Mon Oct 28 17:00:53 2019 From: geobgeo at yahoo.com (Bulbul Ahmmed) Date: Mon, 28 Oct 2019 21:00:53 +0000 (UTC) Subject: [scikit-learn] Can we say stochastic gradient descent as an ML model? References: <563232411.3485155.1572296453964.ref@mail.yahoo.com> Message-ID: <563232411.3485155.1572296453964@mail.yahoo.com> Dear Scikit Learn Community! Scikit learn puts stochastic gradient descent (SGD) as an ML model under the umbrella of linear model. I know SGD is an optimization algorithm. My question is: can we say SGD is an ML model? Thanks, Best Regards,Bulbul -------------- next part -------------- An HTML attachment was scrubbed... URL: From vaggi.federico at gmail.com Mon Oct 28 17:05:07 2019 From: vaggi.federico at gmail.com (federico vaggi) Date: Mon, 28 Oct 2019 14:05:07 -0700 Subject: [scikit-learn] Can we say stochastic gradient descent as an ML model? In-Reply-To: <563232411.3485155.1572296453964@mail.yahoo.com> References: <563232411.3485155.1572296453964.ref@mail.yahoo.com> <563232411.3485155.1572296453964@mail.yahoo.com> Message-ID: In this case, SGD just means a linear model that is fit using stochastic gradient descent instead of batch gradient methods. If you want to have more control about the combination of model / loss function / optimization algorithm, http://contrib.scikit-learn.org/lightning/ is better oriented for that specific use case. On Mon, Oct 28, 2019 at 2:01 PM Bulbul Ahmmed via scikit-learn < scikit-learn at python.org> wrote: > Dear Scikit Learn Community! > > Scikit learn puts stochastic gradient descent (SGD) as an ML model under > the umbrella of linear model. I know SGD is an optimization algorithm. My > question is: can we say SGD is an ML model? Thanks, > > Best Regards, > Bulbul > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Mon Oct 28 17:07:28 2019 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Mon, 28 Oct 2019 16:07:28 -0500 Subject: [scikit-learn] Can we say stochastic gradient descent as an ML model? In-Reply-To: <563232411.3485155.1572296453964@mail.yahoo.com> References: <563232411.3485155.1572296453964.ref@mail.yahoo.com> <563232411.3485155.1572296453964@mail.yahoo.com> Message-ID: <6DE5D65B-E210-4332-A1F8-35BC6AD30886@sebastianraschka.com> Hi Bulbul, I would rather say SGD is a method for optimizing the objective function of certain ML models, or optimize the loss function of certain ML models / learn the parameters of certain ML models. Best, Sebastian > On Oct 28, 2019, at 4:00 PM, Bulbul Ahmmed via scikit-learn wrote: > > Dear Scikit Learn Community! > > Scikit learn puts stochastic gradient descent (SGD) as an ML model under the umbrella of linear model. I know SGD is an optimization algorithm. My question is: can we say SGD is an ML model? Thanks, > > Best Regards, > Bulbul > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From geobgeo at yahoo.com Mon Oct 28 17:11:11 2019 From: geobgeo at yahoo.com (Bulbul Ahmmed) Date: Mon, 28 Oct 2019 21:11:11 +0000 (UTC) Subject: [scikit-learn] Can we say stochastic gradient descent as an ML model? In-Reply-To: References: <563232411.3485155.1572296453964.ref@mail.yahoo.com> <563232411.3485155.1572296453964@mail.yahoo.com> Message-ID: <1228510236.3478546.1572297071046@mail.yahoo.com> Thanks, Federico.? Bulbul Ahmmed?Graduate Teaching Assistant | GeologyBaylor University, Waco, TX 76706 On Monday, October 28, 2019, 03:06:15 PM MDT, federico vaggi wrote: In this case, SGD just means a linear model that is fit using stochastic gradient descent instead of batch gradient methods.? If you want to have more control about the combination of model / loss function / optimization algorithm,?http://contrib.scikit-learn.org/lightning/?is better oriented for that specific use case. On Mon, Oct 28, 2019 at 2:01 PM Bulbul Ahmmed via scikit-learn wrote: Dear Scikit Learn Community! Scikit learn puts stochastic gradient descent (SGD) as an ML model under the umbrella of linear model. I know SGD is an optimization algorithm. My question is: can we say SGD is an ML model? Thanks, Best Regards,Bulbul_______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Thu Oct 31 04:57:39 2019 From: pahome.chen at mirlab.org (lampahome) Date: Thu, 31 Oct 2019 16:57:39 +0800 Subject: [scikit-learn] Is there possible to combine multiple patterns in one regression model? Message-ID: I have an idea to predict usage of every blocks of one disk, and I found pattern of blocks are related with time. Ex: block index 0~100 have high access times at 00:00, 12:00, and 18:00 and for 10 minutes. other block index 1000~1100 have high access times at 05:00, 14:00, and 20:00 and for 10 minutes. >From above examples, I assume that some blocks have similar pattern, and other blocks have similar pattern...etc But I have 100,000 blocks, and I can get features like access_times, blk_ID, timestamp(e.g. 00:00~23:59). Is that possible I feed them all into one regression model and predict well? thx -------------- next part -------------- An HTML attachment was scrubbed... URL: From ayoub.abozer at gmail.com Thu Oct 31 19:32:25 2019 From: ayoub.abozer at gmail.com (Ayoub Abozer) Date: Fri, 1 Nov 2019 01:32:25 +0200 Subject: [scikit-learn] Fwd: CutEncoder - simple suggestion for sklearn.preprocessing In-Reply-To: References: Message-ID: ---------- Forwarded message --------- From: Ayoub Abozer Date: Fri, Nov 1, 2019 at 1:27 AM Subject: CutEncoder - simple suggestion for sklearn.preprocessing To: Hello. please take a loot to my kaggle notebook. i have a simple suggestion for new encoder for sklearn.preprocessing. https://www.kaggle.com/ayoubabozer/cutencoder/notebook?scriptVersionId=22834623 Thanks, Ayoub -------------- next part -------------- An HTML attachment was scrubbed... URL: From ayoub.abozer at gmail.com Thu Oct 31 19:33:48 2019 From: ayoub.abozer at gmail.com (Ayoub Abozer) Date: Fri, 1 Nov 2019 01:33:48 +0200 Subject: [scikit-learn] CutEncoder - simple suggestion for sklearn.preprocessing Message-ID: Hello. please take a loot to my kaggle notebook. i have a simple suggestion for new encoder for sklearn.preprocessing. https://www.kaggle.com/ayoubabozer/cutencoder/notebook?scriptVersionId=22834623 Thanks, Ayoub -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Thu Oct 31 19:44:57 2019 From: joel.nothman at gmail.com (Joel Nothman) Date: Fri, 1 Nov 2019 10:44:57 +1100 Subject: [scikit-learn] CutEncoder - simple suggestion for sklearn.preprocessing In-Reply-To: References: Message-ID: Why is this preferable to KBinsDiscretizer? Where the bin edges are fixed, FunctionTransformer can be used with pandas.cut. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ayoub.abozer at gmail.com Thu Oct 31 19:57:07 2019 From: ayoub.abozer at gmail.com (Ayoub Abozer) Date: Fri, 1 Nov 2019 01:57:07 +0200 Subject: [scikit-learn] CutEncoder - simple suggestion for sklearn.preprocessing In-Reply-To: References: Message-ID: sorry, i did not know it before :(. thanks. On Fri, Nov 1, 2019 at 1:46 AM Joel Nothman wrote: > > Why is this preferable to KBinsDiscretizer? > > Where the bin edges are fixed, FunctionTransformer can be used with > pandas.cut. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ayoub.abozer at gmail.com Thu Oct 31 19:59:39 2019 From: ayoub.abozer at gmail.com (Ayoub Abozer) Date: Fri, 1 Nov 2019 01:59:39 +0200 Subject: [scikit-learn] CutEncoder - simple suggestion for sklearn.preprocessing In-Reply-To: References: Message-ID: i thought finally i will add something to scikit-learn :) On Fri, Nov 1, 2019 at 1:57 AM Ayoub Abozer wrote: > sorry, i did not know it before :(. > > thanks. > > On Fri, Nov 1, 2019 at 1:46 AM Joel Nothman > wrote: > >> >> Why is this preferable to KBinsDiscretizer? >> >> Where the bin edges are fixed, FunctionTransformer can be used with >> pandas.cut. >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Thu Oct 31 22:28:46 2019 From: joel.nothman at gmail.com (Joel Nothman) Date: Fri, 1 Nov 2019 13:28:46 +1100 Subject: [scikit-learn] CutEncoder - simple suggestion for sklearn.preprocessing In-Reply-To: References: Message-ID: There is plenty to be contributed! but this one was solved a couple of years ago ;) On Fri, 1 Nov 2019 at 11:01, Ayoub Abozer wrote: > i thought finally i will add something to scikit-learn :) > > On Fri, Nov 1, 2019 at 1:57 AM Ayoub Abozer > wrote: > >> sorry, i did not know it before :(. >> >> thanks. >> >> On Fri, Nov 1, 2019 at 1:46 AM Joel Nothman >> wrote: >> >>> >>> Why is this preferable to KBinsDiscretizer? >>> >>> Where the bin edges are fixed, FunctionTransformer can be used with >>> pandas.cut. >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: