[scikit-learn] Random Forest max_features and boostrap construction parameters interpretation

Brown J.B. jbbrown at kuhp.kyoto-u.ac.jp
Mon Jun 5 10:46:27 EDT 2017


Dear community,

This is a question regarding how to interpret the documentation and
semantics of the random forest constructors.

In forest.py (of version 0.17 which I am still using), the documentation
regarding the number of features to consider states on lines 742-745 of the
source code that the search may effectively inspect more than
`max_features` when determining the features to pick from in order to split
a node.
It also states that it is tree specific.

Am I correct in:

Interpretation #1 - For bootstrap=True, sampling with replacement occurs
for the number of training instances available, meaning that the subsample
presented to a particular tree will have some probability of containing
overlaps and therefore not the full input training set, but for
bootstrap=False, the entire dataset will be presented to each tree?

Interpretation #2 - Particularly, with the way I interpret the
documentation stating that "The sub-sample size is always the same as the
original input sample size...", it seems to me that bootstrap=False then
provides the entire training dataset to each decision tree, and it is a
matter of which feature was randomly selected first from the features given
that determines what the tree will become.
That would suggest that, if bootstrap=False, and if the number of trees is
high but the feature dimensionality is very low, then there is a high
possibility that multiple copies of the same tree will emerge from the
forest.

Interpretation #3 - the feature subset is not subsampled per tree, but
rather all features are presented for the subsampled training data provided
to a tree ?  For example, if the dimensionality is 400 on a 6000-input
training dataset that has randomly been subsampled (with bootstrap=True) to
yield 4700 unique training samples, then the tree builder will consider all
400 dimensions/features with respect to the 4700 samples, picking at most
`max_features` number of features (out of 400) for building splits in the
tree?  So by default (sqrt/auto), there would be at most 20 splits in the
tree?

Confirmations, denials, and corrections to my interpretations are _highly_
welcome.

As always, my great thanks to the community.

With kind regards,
J.B. Brown
Kyoto University Graduate School of Medicine
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170605/0f4643b9/attachment.html>


More information about the scikit-learn mailing list