[scikit-learn] Problem with nested cross-validation example?

Daniel Homola daniel.homola11 at imperial.ac.uk
Tue Nov 29 06:01:04 EST 2016


Sorry, should've done that.

Thanks for the PR. To me it isn't the actual concept of nested CV that 
needs more detailed explanation but the implementation in scikit-learn.

I think it's not obvious at all for a newcomer (heck, I've been using it 
for years on and off and even I got confused) that the clf GridSearch 
object will carry it's inner CV object into the cross_val_score 
function, which has it's own outer CV object. Unless you know that in 
scikit-learn the CV object of an estimator is *NOT* overloaded with the 
cross_val_score function's cv parameter, but rather it will result in a 
nested CV, you simply cannot work out why this example works.. This is 
the confusing bit I think.. Do you want me to add comments that 
highlight this issue?


On 29/11/16 10:48, Joel Nothman wrote:
> Wait an hour for the docs to build and you won't get artifact not 
> found :)
>
> If you'd looked at the PR diff, you'd see I've modified the 
> description to refer directly to GridSearchCV and cross_val_score:
>
>     In the inner loop (here executed by |GridSearchCV|), the score is
>     approximately maximized by fitting a model to each training set,
>     and then directly maximized in selecting (hyper)parameters over
>     the validation set. In the outer loop (here in |cross_val_score|), ...
>
>
> Further comments in the code are welcome.
>
> On 29 November 2016 at 21:42, Albert Thomas <albertthomas88 at gmail.com 
> <mailto:albertthomas88 at gmail.com>> wrote:
>
>     I also get "artifact not found". And I agree with Daniel.
>
>     Once you decompose what the code is doing you realize that it does
>     the job. The simplicity of the code to perform nested cross
>     validation using scikit learn objects is impressive but I guess it
>     also makes it less obvious. So making the example clearer by
>     explaining what the code does or by adding a few comments can be
>     useful for others.
>
>     Albert
>
>     On Tue, 29 Nov 2016 at 11:19, Daniel Homola
>     <daniel.homola11 at imperial.ac.uk
>     <mailto:daniel.homola11 at imperial.ac.uk>> wrote:
>
>         Hi Joel,
>
>         Thanks a lot for the answer.
>
>         "Each train/test split in cross_val_score holds out test data.
>         GridSearchCV then splits each train set into (inner-)train and
>         validation sets. "
>
>         I know this is what nested CV supposed to do but the code is
>         doing an excellent job at obscuring this. I'll try and add
>         some clarification in as comments later today.
>
>         Cheers,
>
>         d
>
>
>         On 29/11/16 00:07, Joel Nothman wrote:
>>         If that clarifies, please offer changes to the example (as a
>>         pull request) that make this clearer.
>>
>>         On 29 November 2016 at 11:06, Joel Nothman
>>         <joel.nothman at gmail.com <mailto:joel.nothman at gmail.com>> wrote:
>>
>>             Briefly:
>>
>>             clf  =  GridSearchCV
>>             <http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV>(estimator=svr,  param_grid=p_grid,  cv=inner_cv)
>>             nested_score  =  cross_val_score
>>             <http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score>(clf,  X=X_iris,  y=y_iris,  cv=outer_cv)
>>
>>
>>             Each train/test split in cross_val_score holds out test
>>             data. GridSearchCV then splits each train set into
>>             (inner-)train and validation sets. There is no leakage of
>>             test set knowledge from the outer loop into the grid
>>             search optimisation; no leakage of validation set
>>             knowledge into the SVR optimisation. The outer test data
>>             are reused as training data, but within each split are
>>             only used to measure generalisation error.
>>
>>             Is that clear?
>>
>>             On 29 November 2016 at 10:30, Daniel Homola
>>             <dani.homola at gmail.com <mailto:dani.homola at gmail.com>> wrote:
>>
>>                 Dear all,
>>
>>
>>                 I was wondering if the following example code is valid:
>>
>>                 http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html
>>                 <http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html>
>>
>>                 My understanding is, that the point of nested
>>                 cross-validation is to prevent any data leakage from
>>                 the inner grid-search/param optimization CV loop into
>>                 the outer model evaluation CV loop. This could be
>>                 achieved if the outer CV loop's test data is
>>                 completely separated from the inner loop's CV, as
>>                 shown here:
>>
>>                 https://mlr-org.github.io/mlr-tutorial/release/html/img/nested_resampling.png
>>                 <https://mlr-org.github.io/mlr-tutorial/release/html/img/nested_resampling.png>
>>
>>
>>                 The code in the above example however doesn't seem to
>>                 achieve this in any way.
>>
>>
>>                 Am I missing something here?
>>
>>
>>                 Thanks a lot,
>>
>>                 dh
>>
>>
>>                 _______________________________________________
>>                 scikit-learn mailing list
>>                 scikit-learn at python.org <mailto:scikit-learn at python.org>
>>                 https://mail.python.org/mailman/listinfo/scikit-learn
>>                 <https://mail.python.org/mailman/listinfo/scikit-learn>
>>
>>
>>
>>
>>
>>         _______________________________________________
>>         scikit-learn mailing list
>>         scikit-learn at python.org <mailto:scikit-learn at python.org>
>>         https://mail.python.org/mailman/listinfo/scikit-learn
>>         <https://mail.python.org/mailman/listinfo/scikit-learn>
>         _______________________________________________ scikit-learn
>         mailing list scikit-learn at python.org
>         <mailto:scikit-learn at python.org>
>         https://mail.python.org/mailman/listinfo/scikit-learn
>         <https://mail.python.org/mailman/listinfo/scikit-learn> 
>
>     _______________________________________________ scikit-learn
>     mailing list scikit-learn at python.org
>     <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
>     <https://mail.python.org/mailman/listinfo/scikit-learn> 
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161129/330b0b66/attachment-0001.html>


More information about the scikit-learn mailing list