[scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

Fri Oct 4 14:44:04 EDT 2019

> But, decision tree is still mistaking one-hot-encoding as numerical 
> input and split at 0.5. This is not right. Perhaps, I'm doing 
> something wrong?

You're not doing anything wrong, and neither is the tree. Trees don't 
support categorical variables in sklearn, so everything is treated as 
numerical.

This is why we do one-hot-encoding: so that a set of numerical (one hot 
encoded) features can be treated as if they were just one categorical 
feature.

Nicolas

On 10/4/19 2:01 PM, C W wrote:
> Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on 
> my part.
>
> Looks like I did one-hot-encoding correctly. My new variable names 
> are: car_Audi, car_BMW, etc.
>
> But, decision tree is still mistaking one-hot-encoding as numerical 
> input and split at 0.5. This is not right. Perhaps, I'm doing 
> something wrong?
>
> Is there a good toy example on the sklearn website? I am only see 
> this: 
> https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html.
>
> Thanks!
>
>
>
> On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka 
> <mail at sebastianraschka.com <mailto:mail at sebastianraschka.com>> wrote:
>
>     Hi,
>
>>     The funny part is: the tree is taking one-hot-encoding (BMW=0,
>>     Toyota=1, Audi=2) as numerical values, not category.The tree
>>     splits at 0.5 and 1.5
>
>     that's not a onehot encoding then.
>
>     For an Audi datapoint, it should be
>
>     BMW=0
>     Toyota=0
>     Audi=1
>
>     for BMW
>
>     BMW=1
>     Toyota=0
>     Audi=0
>
>     and for Toyota
>
>     BMW=0
>     Toyota=1
>     Audi=0
>
>     The split threshold should then be at 0.5 for any of these features.
>
>     Based on your email, I think you were assuming that the DT does
>     the one-hot encoding internally, which it doesn't. In practice, it
>     is hard to guess what is a nominal and what is a ordinal variable,
>     so you have to do the onehot encoding before you give the data to
>     the decision tree.
>
>     Best,
>     Sebastian
>
>>     On Oct 4, 2019, at 11:48 AM, C W <tmrsg11 at gmail.com
>>     <mailto:tmrsg11 at gmail.com>> wrote:
>>
>>     I'm getting some funny results. I am doing a regression decision
>>     tree, the response variables are assigned to levels.
>>
>>     The funny part is: the tree is taking one-hot-encoding (BMW=0,
>>     Toyota=1, Audi=2) as numerical values, not category.
>>
>>     The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding
>>     wrong? How does the sklearn know internally 0 vs. 1 is
>>     categorical, not numerical?
>>
>>     In R for instance, you do as.factor(), which explicitly states
>>     the data type.
>>
>>     Thank you!
>>
>>
>>     On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller
>>     <t3kcit at gmail.com <mailto:t3kcit at gmail.com>> wrote:
>>
>>
>>
>>         On 9/15/19 8:16 AM, Guillaume Lemaître wrote:
>>>
>>>
>>>         On Sat, 14 Sep 2019 at 20:59, C W <tmrsg11 at gmail.com
>>>         <mailto:tmrsg11 at gmail.com>> wrote:
>>>
>>>             Thanks, Guillaume.
>>>             Column transformer looks pretty neat. I've also heard
>>>             though, this pipeline can be tedious to set up?
>>>             Specifying what you want for every feature is a pain.
>>>
>>>
>>>         It would be interesting for us which part of the pipeline is
>>>         tedious to set up to know if we can improve something there.
>>>         Do you mean, that you would like to automatically detect of
>>>         which type of feature (categorical/numerical) and apply a
>>>         default encoder/scaling such as discuss there:
>>>         https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127
>>>
>>>         IMO, one a user perspective, it would be cleaner in some
>>>         cases at the cost of applying blindly a black box
>>>         which might be dangerous.
>>         Also see
>>         https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor
>>         Which basically does that.
>>
>>
>>>
>>>             Jaiver,
>>>             Actually, you guessed right. My real data has only one
>>>             numerical variable, looks more like this:
>>>
>>>             Gender Date Income  Car   Attendance
>>>             Male     2019/3/01   10000 BMW          Yes
>>>             Female 2019/5/02    9000  Toyota          No
>>>             Male     2019/7/15   12000 Audi           Yes
>>>
>>>             I am predicting income using all other categorical
>>>             variables. Maybe it is catboost!
>>>
>>>             Thanks,
>>>
>>>             M
>>>
>>>
>>>
>>>
>>>
>>>
>>>             On Sat, Sep 14, 2019 at 9:25 AM Javier López
>>>             <jlopez at ende.cc> <mailto:jlopez at ende.cc> wrote:
>>>
>>>                 If you have datasets with many categorical features,
>>>                 and perhaps many categories, the tools in sklearn
>>>                 are quite limited,
>>>                 but there are alternative implementations of boosted
>>>                 trees that are designed with categorical features in
>>>                 mind. Take a look
>>>                 at catboost [1], which has an sklearn-compatible API.
>>>
>>>                 J
>>>
>>>                 [1] https://catboost.ai/
>>>
>>>                 On Sat, Sep 14, 2019 at 3:40 AM C W
>>>                 <tmrsg11 at gmail.com <mailto:tmrsg11 at gmail.com>> wrote:
>>>
>>>                     Hello all,
>>>                     I'm very confused. Can the decision tree module
>>>                     handle both continuous and categorical features
>>>                     in the dataset? In this case, it's just CART
>>>                     (Classification and Regression Trees).
>>>
>>>                     For example,
>>>                     Gender Age Income Car   Attendance
>>>                     Male     30   10000 BMW          Yes
>>>                     Female 35     9000 Toyota          No
>>>                     Male     50   12000 Audi           Yes
>>>
>>>                     According to the documentation
>>>                     https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart,
>>>                     it can not!
>>>
>>>                     It says: "scikit-learn implementation does not
>>>                     support categorical variables for now".
>>>
>>>                     Is this true? If not, can someone point me to an
>>>                     example? If yes, what do people do?
>>>
>>>                     Thank you very much!
>>>
>>>
>>>
>>>                     _______________________________________________
>>>                     scikit-learn mailing list
>>>                     scikit-learn at python.org
>>>                     <mailto:scikit-learn at python.org>
>>>                     https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>                 _______________________________________________
>>>                 scikit-learn mailing list
>>>                 scikit-learn at python.org <mailto:scikit-learn at python.org>
>>>                 https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>             _______________________________________________
>>>             scikit-learn mailing list
>>>             scikit-learn at python.org <mailto:scikit-learn at python.org>
>>>             https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>>
>>>         -- 
>>>         Guillaume Lemaitre
>>>         INRIA Saclay - Parietal team
>>>         Center for Data Science Paris-Saclay
>>>         https://glemaitre.github.io/
>>>
>>>         _______________________________________________
>>>         scikit-learn mailing list
>>>         scikit-learn at python.org  <mailto:scikit-learn at python.org>
>>>         https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>         _______________________________________________
>>         scikit-learn mailing list
>>         scikit-learn at python.org <mailto:scikit-learn at python.org>
>>         https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>     _______________________________________________
>>     scikit-learn mailing list
>>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>>     https://mail.python.org/mailman/listinfo/scikit-learn
>
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20191004/87c1fad9/attachment-0001.html>