[scikit-learn] Query about use of standard deviation on tree feature_importances_ in demo plot_forest_importances.html

Fri Jun 23 13:12:24 EDT 2017

Hi all. I'm looking at the code behind one of the tree ensemble demos:
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
and I'm unsure about the error bars.

They are calculated using the standard deviation of the
feature_importances_ attribute across trees. Can we depend on this
being a Normal distribution? I'm wondering if the plot tells enough of
the story to be genuinely useful?

I don't have a strong belief in the likely distribution of
feature_importances_, I haven't dug into how the feature importances
are calculated (frankly I'm a bit lost here). I know that on a RF
Regression case I'm working on I can see unimodal and bimodal feature
importance distributions - this came up on a discussion on the
yellowbrick sklearn visualisation package:
https://github.com/DistrictDataLabs/yellowbrick/pull/195

I don't know what is "normal" for feature importances and if they look
different between classification tasks (as in the
plot_forest_importances demo) and regression tasks. Maybe I've got an
outlier in my task? If I use the provided demo code then my error bars
can go negative, so that feels unhelpful.

Does anyone have an opinion? Perhaps more importantly - is a visual
indication of the spread of feature importances in an ensemble
actually a useful thing to plot? Does it serve a diagnostic value?

I saw Sebastian Raschka's reference to Gilles Louppe et al.'s NIPS
paper (in here, 2016-05-17) on variable importances, I'll dig into
that if nobody has a strong opinion. BTW Sebastian - thanks for
writing your book.

Cheers, Ian.

-- 
Ian Ozsvald (Data Scientist, PyDataLondon co-chair)
ian at IanOzsvald.com

http://IanOzsvald.com
http://ModelInsight.io
http://twitter.com/IanOzsvald