[scikit-learn] feature importance calculation in gradient boosting

urvesh patel urvesh.patel11 at gmail.com
Thu Apr 20 00:51:49 EDT 2017


I believe your random variable by chance have some predictive power. In R,
use Information package and check information value of that randomly
created variable. If it is > 0.05 then it has good predictive power.
On Tue, Apr 18, 2017 at 7:47 AM Olga Lyashevska <o.lyashevskaya at gmail.com>
wrote:

> Hi,
>
> I would like to understand how feature importances are calculated in
> gradient boosting regression.
>
> I know that these are the relevant functions:
>
> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/gradient_boosting.py#L1165
>
> https://github.com/scikit-learn/scikit-learn/blob/fc2f24927fc37d7e42917369f17de045b14c59b5/sklearn/tree/_tree.pyx#L1056
>
>  From the literature and elsewhere I understand that Gini impurity is
> calculated. What is this exactly and how does it relate to 'gain' vs
> 'frequency' implemented in XGBoost?
> http://xgboost.readthedocs.io/en/latest/R-package/discoverYourData.html
>
> My problem is that when I fit exactly same model in sklearn and gbm (R
> package) I get different variable importance plots. One of the variables
> which was generated randomly (keeping all other variables real) appears
> to be very important in sklearn and very unimportant in gbm. How is this
> possible that completely random variable gets the highest importance?
>
>
> Many thanks,
> Olga
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170420/83deceae/attachment.html>


More information about the scikit-learn mailing list