[scikit-learn] A necessary feature for Decision trees

Thu Jan 4 10:02:48 EST 2018

Hi Yang Li,

I have to agree with you. Bitset and/or one hot encoding are just hacks which should not be necessary for decision tree learners.

There is some WIP on an implementation for natural handling of categorical features in trees: please take a look at https://github.com/scikit-learn/scikit-learn/pull/4899

Cheers!

--
Julio

> El 4 ene 2018, a las 9:06, 李扬 <sky188133882 at 163.com> escribió:
> 
> Dear J.B.,
> 
> Thanks for your advice!
> 
> Yeah, I have considered using bitstring or sequence number, but the problem is the algorithm not the representation of categorical data.
> Take the regression tree as an example, the algorithm in sklearn find a split value of the feature, and find the best split by computing the minimal impurity of child nodes.
> However, find a split of the categorical feature is not that meaningful even though u represent it as continuous value, and the split result is partially depends on how u permute the value in categorical  feature, which is not very persuasive.
> Instead, in the CART algorithm, u should separate each category in the feature from others and compute the impurity of the two sets. Then find the best separation strategy with the minimal impurity.
> Obviously, this separation process can`t be finished by current algorithm which simply use the split method on continuous value.
> 
> One more possible shortcoming is the categorical feature can`t be properly visualized. when forming a tree graph, it`s hard to get information from the categorical feature node while u just split it.
> 
> Thank you for your time!
> Best wishes.
> 
> 
> 
> 
> --
> 顺颂时祺！
> 
> 
> 李扬 
> 上海交通大学  电子信息 与 电气工程 学院  
> 电话：18818212371
> 地址：上海市闵行区东川路800号
> 邮编：200240
> 
> Yang Li  +86 188 1821 2371
> Shanghai Jiao Tong University
> School of Electronic，Information and Electrical Engineering F1203026
> 800 Dongchuan Road, Minhang District, Shanghai 200240
> 
> 
>  
> 
> At 2018-01-04 15:30:34, "Brown J.B. via scikit-learn" <scikit-learn at python.org> wrote:
> Dear Yang Li,
> 
> > Neither the classificationTree nor the regressionTree supports categorical feature. That means the Decision trees model can only accept continuous feature. 
> 
> Consider either manually encoding your categories in bitstrings (e.g., "Facebook" = 001, "Twitter" = 010, "Google" = 100), or using OneHotEncoder to do the same thing for you automatically.
> 
> Cheers,
> J.B.
> 
> 
>  
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180104/c8484ed7/attachment-0001.html>