[scikit-learn] feature importance calculation in gradient boosting

Olga Lyashevska o.lyashevskaya at gmail.com
Tue Apr 18 10:19:11 EDT 2017


Hi,

I would like to understand how feature importances are calculated in 
gradient boosting regression.

I know that these are the relevant functions:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/gradient_boosting.py#L1165
https://github.com/scikit-learn/scikit-learn/blob/fc2f24927fc37d7e42917369f17de045b14c59b5/sklearn/tree/_tree.pyx#L1056

 From the literature and elsewhere I understand that Gini impurity is 
calculated. What is this exactly and how does it relate to 'gain' vs 
'frequency' implemented in XGBoost?
http://xgboost.readthedocs.io/en/latest/R-package/discoverYourData.html

My problem is that when I fit exactly same model in sklearn and gbm (R 
package) I get different variable importance plots. One of the variables 
which was generated randomly (keeping all other variables real) appears 
to be very important in sklearn and very unimportant in gbm. How is this 
possible that completely random variable gets the highest importance?


Many thanks,
Olga


More information about the scikit-learn mailing list