[scikit-learn] Why does sklearn require one-hot-encoding for categorical features? Can we have a "factor" data type?

Fernando Marcos Wittmann fernando.wittmann at gmail.com
Wed May 6 09:36:55 EDT 2020


That's an excellent discussion! I've always wondered how other tools like R
handled naturally categorical variables or not. LightGBM has a scikit-learn
API which handles categorical features by inputting their columns names (or
indexes):
```
import lightgbm
lgb=lightgbm.LGBMClassifier()
lgb.fit(*X*, *y*, *feature_name=... *, *categorical_feature=... *)

```

Where:

- feature_name (list of strings or 'auto', optional (default='auto')) –
Feature names. If ‘auto’ and data is pandas DataFrame, data columns names
are used.

- categorical_feature (list of strings or int, or 'auto', optional
(default='auto')) – Categorical features. If list of int, interpreted as
indices. If list of strings, interpreted as feature names (need to specify
feature_name as well). If ‘auto’ and data is pandas DataFrame, pandas
unordered categorical columns are used. All values in categorical features
should be less than int32 max value (2147483647).


As a suggestion, Scikit-Learn could add a `categorical_feature` parameter
in the tree-based estimators in order to work on the same way.

On Fri, May 1, 2020 at 12:54 PM C W <tmrsg11 at gmail.com> wrote:

> Thank you for the link, Guilaumme. In my particular case, I am working on
> random forest classification.
>
> The notebook seems great. I will have to go through it in detail. I'm
> still fairly new at using sklearn.
>
> Thank you for everyone's quick response, always feeling loved on here! :)
>
>
>
> On Fri, May 1, 2020 at 4:00 AM Guillaume Lemaître <g.lemaitre58 at gmail.com>
> wrote:
>
>> OrdinalEncoder is the equivalent of pd.factorize and will work in the
>> scikit-learn ecosystem.
>>
>> However, be aware that you should not just swap OneHotEncoder to
>> OrdinalEncoder just at your wish.
>> It depends of your machine learning pipeline.
>>
>> As mentioned by Gael, tree-based algorithm will be fine with
>> OrdinalEncoder. If you have a linear model,
>> then you need to use the OneHotEncoder if the categories do not have any
>> order.
>>
>> I will just refer to one notebook that we taught in EuroScipy last year:
>>
>> https://github.com/lesteve/euroscipy-2019-scikit-learn-tutorial/blob/master/rendered_notebooks/02_basic_preprocessing.ipynb
>>
>> On Fri, 1 May 2020 at 05:11, C W <tmrsg11 at gmail.com> wrote:
>>
>>> Hermes,
>>>
>>> That's an interesting function. Does it work with sklearn after
>>> factorize?  Is there any example? Thanks!
>>>
>>> On Thu, Apr 30, 2020 at 6:51 PM Hermes Morales <
>>> paisanohermes at hotmail.com> wrote:
>>>
>>>> Perhaps pd.factorize could hello?
>>>>
>>>> Obtener Outlook para Android <https://aka.ms/ghei36>
>>>>
>>>> ------------------------------
>>>> *From:* scikit-learn <scikit-learn-bounces+paisanohermes=
>>>> hotmail.com at python.org> on behalf of Gael Varoquaux <
>>>> gael.varoquaux at normalesup.org>
>>>> *Sent:* Thursday, April 30, 2020 5:12:06 PM
>>>> *To:* Scikit-learn mailing list <scikit-learn at python.org>
>>>> *Subject:* Re: [scikit-learn] Why does sklearn require
>>>> one-hot-encoding for categorical features? Can we have a "factor" data type?
>>>>
>>>> On Thu, Apr 30, 2020 at 03:55:00PM -0400, C W wrote:
>>>> > I've used R and Stata software, none needs such transformation. They
>>>> have a
>>>> > data type called "factors", which is different from "numeric".
>>>>
>>>> > My problem with OHE:
>>>> > One-hot-encoding results in large number of features. This really
>>>> blows up
>>>> > quickly. And I have to fight curse of dimensionality with PCA
>>>> reduction. That's
>>>> > not cool!
>>>>
>>>> Most statistical models still not one-hot encoding behind the hood. So,
>>>> R
>>>> and stata do it too.
>>>>
>>>> Typically, tree-based models can be adapted to work directly on
>>>> categorical data. Ours don't. It's work in progress.
>>>>
>>>> G
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>>
>>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fscikit-learn&data=02%7C01%7C%7Ce7aa6f99b7914a1f84b208d7ed430801%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637238744453345410&sdata=e3BfHB4v5VFteeZ0Zh3FJ9Wcz9KmkUwur5i8Reue3mc%3D&reserved=0
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>
>>
>> --
>> Guillaume Lemaitre
>> Scikit-learn @ Inria Foundation
>> https://glemaitre.github.io/
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200506/8b4c5d94/attachment-0001.html>


More information about the scikit-learn mailing list