[scikit-learn] Can Scikit-learn decision tree (CART) have both continuous and categorical features?

C W tmrsg11 at gmail.com
Fri Oct 4 14:01:23 EDT 2019


Yes, you are right. it was 0.5 and 0.5 for split, not 1.5. So, typo on my
part.

Looks like I did one-hot-encoding correctly. My new variable names are:
car_Audi, car_BMW, etc.

But, decision tree is still mistaking one-hot-encoding as numerical input
and split at 0.5. This is not right. Perhaps, I'm doing something wrong?

Is there a good toy example on the sklearn website? I am only see this:
https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html
.

Thanks!



On Fri, Oct 4, 2019 at 1:28 PM Sebastian Raschka <mail at sebastianraschka.com>
wrote:

> Hi,
>
> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1,
> Audi=2) as numerical values, not category.The tree splits at 0.5 and 1.5
>
>
> that's not a onehot encoding then.
>
> For an Audi datapoint, it should be
>
> BMW=0
> Toyota=0
> Audi=1
>
> for BMW
>
> BMW=1
> Toyota=0
> Audi=0
>
> and for Toyota
>
> BMW=0
> Toyota=1
> Audi=0
>
> The split threshold should then be at 0.5 for any of these features.
>
> Based on your email, I think you were assuming that the DT does the
> one-hot encoding internally, which it doesn't. In practice, it is hard to
> guess what is a nominal and what is a ordinal variable, so you have to do
> the onehot encoding before you give the data to the decision tree.
>
> Best,
> Sebastian
>
> On Oct 4, 2019, at 11:48 AM, C W <tmrsg11 at gmail.com> wrote:
>
> I'm getting some funny results. I am doing a regression decision tree, the
> response variables are assigned to levels.
>
> The funny part is: the tree is taking one-hot-encoding (BMW=0, Toyota=1,
> Audi=2) as numerical values, not category.
>
> The tree splits at 0.5 and 1.5. Am I doing one-hot-encoding wrong? How
> does the sklearn know internally 0 vs. 1 is categorical, not numerical?
>
> In R for instance, you do as.factor(), which explicitly states the data
> type.
>
> Thank you!
>
>
> On Wed, Sep 18, 2019 at 11:13 AM Andreas Mueller <t3kcit at gmail.com> wrote:
>
>>
>>
>> On 9/15/19 8:16 AM, Guillaume Lemaître wrote:
>>
>>
>>
>> On Sat, 14 Sep 2019 at 20:59, C W <tmrsg11 at gmail.com> wrote:
>>
>>> Thanks, Guillaume.
>>> Column transformer looks pretty neat. I've also heard though, this
>>> pipeline can be tedious to set up? Specifying what you want for every
>>> feature is a pain.
>>>
>>
>> It would be interesting for us which part of the pipeline is tedious to
>> set up to know if we can improve something there.
>> Do you mean, that you would like to automatically detect of which type of
>> feature (categorical/numerical) and apply a
>> default encoder/scaling such as discuss there:
>> https://github.com/scikit-learn/scikit-learn/issues/10603#issuecomment-401155127
>>
>> IMO, one a user perspective, it would be cleaner in some cases at the
>> cost of applying blindly a black box
>> which might be dangerous.
>>
>> Also see
>> https://amueller.github.io/dabl/dev/generated/dabl.EasyPreprocessor.html#dabl.EasyPreprocessor
>> Which basically does that.
>>
>>
>>
>>
>>>
>>> Jaiver,
>>> Actually, you guessed right. My real data has only one numerical
>>> variable, looks more like this:
>>>
>>> Gender Date            Income  Car   Attendance
>>> Male     2019/3/01   10000   BMW          Yes
>>> Female 2019/5/02    9000   Toyota          No
>>> Male     2019/7/15   12000    Audi           Yes
>>>
>>> I am predicting income using all other categorical variables. Maybe it
>>> is catboost!
>>>
>>> Thanks,
>>>
>>> M
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sat, Sep 14, 2019 at 9:25 AM Javier López <jlopez at ende.cc>
>>> <jlopez at ende.cc> wrote:
>>>
>>>> If you have datasets with many categorical features, and perhaps many
>>>> categories, the tools in sklearn are quite limited,
>>>> but there are alternative implementations of boosted trees that are
>>>> designed with categorical features in mind. Take a look
>>>> at catboost [1], which has an sklearn-compatible API.
>>>>
>>>> J
>>>>
>>>> [1] https://catboost.ai/
>>>>
>>>> On Sat, Sep 14, 2019 at 3:40 AM C W <tmrsg11 at gmail.com> wrote:
>>>>
>>>>> Hello all,
>>>>> I'm very confused. Can the decision tree module handle both continuous
>>>>> and categorical features in the dataset? In this case, it's just CART
>>>>> (Classification and Regression Trees).
>>>>>
>>>>> For example,
>>>>> Gender Age Income  Car   Attendance
>>>>> Male     30   10000   BMW          Yes
>>>>> Female 35     9000  Toyota          No
>>>>> Male     50   12000    Audi           Yes
>>>>>
>>>>> According to the documentation
>>>>> https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart,
>>>>> it can not!
>>>>>
>>>>> It says: "scikit-learn implementation does not support categorical
>>>>> variables for now".
>>>>>
>>>>> Is this true? If not, can someone point me to an example? If yes, what
>>>>> do people do?
>>>>>
>>>>> Thank you very much!
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>
>>
>> --
>> Guillaume Lemaitre
>> INRIA Saclay - Parietal team
>> Center for Data Science Paris-Saclay
>> https://glemaitre.github.io/
>>
>> _______________________________________________
>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20191004/ff30912b/attachment.html>


More information about the scikit-learn mailing list