[scikit-learn] Random Forest with Bootstrapping

Mon Oct 3 15:05:55 EDT 2016

Hi,

Thank you for the reply. Please bear with me for a while.

>From where did this number, 0.632, come? I have no background in statistics
(which appears to be the case here!). Or let me rephrase my query: what is
this bootstrap sampling all about? Searched the web, but didn't get
satisfactory results.

Thanks

On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka <se.raschka at gmail.com>
wrote:

> > From whatever little knowledge I gained last night about Random Forests,
> each tree is trained with a sub-sample of original dataset (usually with
> replacement)?.
>
> Yes, that should be correct!
>
> > Now, what I am not able to understand is - if entire dataset is used to
> train each of the trees, then how does the classifier estimates the OOB
> error? None of the entries of the dataset is an oob for any of the trees.
> (Pardon me if all this sounds BS)
>
> If you take an n-size bootstrap sample, where n is the number of samples
> in your dataset, you have asymptotically 0.632 * n unique samples in your
> bootstrap set. Or in other words 0.368 * n samples are not used for growing
> the respective tree (to compute the OOB). As far as I understand, the
> random forest OOB score is then computed as the average OOB of each tee
> (correct me if I am wrong!).
>
> Best,
> Sebastian
>
> > On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn <
> scikit-learn at python.org> wrote:
> >
> > Dear Developers,
> >
> > From whatever little knowledge I gained last night about Random Forests,
> each tree is trained with a sub-sample of original dataset (usually with
> replacement)?.
> >
> > (Note: Please do correct me if I am not making any sense.)
> >
> > RandomForestClassifier has an option of 'bootstrap'. The API states the
> following
> >
> > The sub-sample size is always the same as the original input sample size
> but the samples are drawn with replacement if bootstrap=True (default).
> >
> > Now, what I am not able to understand is - if entire dataset is used to
> train each of the trees, then how does the classifier estimates the OOB
> error? None of the entries of the dataset is an oob for any of the trees.
> (Pardon me if all this sounds BS)
> >
> > Help this mere mortal.
> >
> > Thanks
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161004/05170701/attachment.html>