[scikit-learn] Why does sklearn require one-hot-encoding for categorical features? Can we have a "factor" data type?
Gael Varoquaux
gael.varoquaux at normalesup.org
Thu Apr 30 16:12:06 EDT 2020
On Thu, Apr 30, 2020 at 03:55:00PM -0400, C W wrote:
> I've used R and Stata software, none needs such transformation. They have a
> data type called "factors", which is different from "numeric".
> My problem with OHE:
> One-hot-encoding results in large number of features. This really blows up
> quickly. And I have to fight curse of dimensionality with PCA reduction. That's
> not cool!
Most statistical models still not one-hot encoding behind the hood. So, R
and stata do it too.
Typically, tree-based models can be adapted to work directly on
categorical data. Ours don't. It's work in progress.
G
More information about the scikit-learn
mailing list