[scikit-learn] Discrepancy in "Feature importances with a forest of trees" documentation

Fri Oct 28 05:18:05 EDT 2022

Dear Scikit-learn community,

I have been reading some examples in https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#feature-importance-based-on-mean-decrease-in-impurity about the permutation importance that can be assessed after fitting a tree-based model (e.g. RandomForestClassifier).

However, I have noticed a discrepancy that I would like to mention. If a one-hot-encoding step is used before model fitting, the `.feature_importances_` attribute includes importances for all the levels of the transformed categorical features (e.g. for gender, we get 2 importances for males & females, respectively.

When I apply the `permutation_importance<https://scikit-learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html#sklearn.inspection.permutation_importance>` functions though, the outputs correspond to the non-transformed data. To illustrate this, I include a toy example in .py format.

Best,
Makis

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/scikit-learn/attachments/20221028/90a35519/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Untitled.py
Type: text/x-python-script
Size: 2410 bytes
Desc: Untitled.py
URL: <https://mail.python.org/pipermail/scikit-learn/attachments/20221028/90a35519/attachment.bin>