[scikit-learn] Classifiers for dataset with categorical features

Sebastian Raschka se.raschka at gmail.com
Fri Jul 21 14:57:57 EDT 2017


> Traditionally tree based methods are very good when it comes to categorical variables and can handle them appropriately. There is a current WIP PR to add this support to sklearn.

I think it's also important to distinguish between nominal and ordinal; it can make a huge difference imho. I.e., treating ordinal variables like continuous variable probably makes more sense than one-hot encoding them. Looking forward to the PR  :)

> On Jul 21, 2017, at 2:52 PM, Sebastian Raschka <se.raschka at gmail.com> wrote:
> 
> Just to throw some additional ideas in here. Based on a conversation with a colleague some time ago, I think learning classifier systems (https://en.wikipedia.org/wiki/Learning_classifier_system) are particularly useful when working with large, sparse binary vectors (like from a one-hot encoding). I am really not into LCS's, and only know the basics (read through the first chapters of the Intro to Learning Classifier Systems draft; the print version will be out later this year). 
> Also, I saw an interesting poster on a Set Covering Machine algorithm once, which they benchmarked against SVMs, random forests and the like for categorical (genomics data). Looked promising.
> 
> Best,
> Sebastian
> 
> 
>> On Jul 21, 2017, at 2:37 PM, Raga Markely <raga.markely at gmail.com> wrote:
>> 
>> Thank you, Jacob. Appreciate it.
>> 
>> Regarding 'perform better', I was referring to better accuracy, precision, recall, F1 score, etc.
>> 
>> Thanks,
>> Raga
>> 
>> On Fri, Jul 21, 2017 at 2:27 PM, Jacob Schreiber <jmschreiber91 at gmail.com> wrote:
>> Traditionally tree based methods are very good when it comes to categorical variables and can handle them appropriately. There is a current WIP PR to add this support to sklearn. I'm not exactly sure what you mean that "perform better" though. Estimators that ignore the categorical aspect of these variables and treat them as discrete will likely perform worse than those that treat them appropriately.
>> 
>> On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely <raga.markely at gmail.com> wrote:
>> Hello,
>> 
>> I am wondering if there are some classifiers that perform better for datasets with categorical features (converted into sparse input matrix with pd.get_dummies())? The data for the categorical features are nominal (order doesn't matter, e.g. country, occupation, etc).
>> 
>> If you could provide me some references (papers, books, website, etc), that would be great.
>> 
>> Thank you very much!
>> Raga
>> 
>> 
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> 
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn



More information about the scikit-learn mailing list