[scikit-learn] Problem with nested cross-validation example?
Daniel Homola
daniel.homola11 at imperial.ac.uk
Tue Nov 29 06:01:04 EST 2016
Sorry, should've done that.
Thanks for the PR. To me it isn't the actual concept of nested CV that
needs more detailed explanation but the implementation in scikit-learn.
I think it's not obvious at all for a newcomer (heck, I've been using it
for years on and off and even I got confused) that the clf GridSearch
object will carry it's inner CV object into the cross_val_score
function, which has it's own outer CV object. Unless you know that in
scikit-learn the CV object of an estimator is *NOT* overloaded with the
cross_val_score function's cv parameter, but rather it will result in a
nested CV, you simply cannot work out why this example works.. This is
the confusing bit I think.. Do you want me to add comments that
highlight this issue?
On 29/11/16 10:48, Joel Nothman wrote:
> Wait an hour for the docs to build and you won't get artifact not
> found :)
>
> If you'd looked at the PR diff, you'd see I've modified the
> description to refer directly to GridSearchCV and cross_val_score:
>
> In the inner loop (here executed by |GridSearchCV|), the score is
> approximately maximized by fitting a model to each training set,
> and then directly maximized in selecting (hyper)parameters over
> the validation set. In the outer loop (here in |cross_val_score|), ...
>
>
> Further comments in the code are welcome.
>
> On 29 November 2016 at 21:42, Albert Thomas <albertthomas88 at gmail.com
> <mailto:albertthomas88 at gmail.com>> wrote:
>
> I also get "artifact not found". And I agree with Daniel.
>
> Once you decompose what the code is doing you realize that it does
> the job. The simplicity of the code to perform nested cross
> validation using scikit learn objects is impressive but I guess it
> also makes it less obvious. So making the example clearer by
> explaining what the code does or by adding a few comments can be
> useful for others.
>
> Albert
>
> On Tue, 29 Nov 2016 at 11:19, Daniel Homola
> <daniel.homola11 at imperial.ac.uk
> <mailto:daniel.homola11 at imperial.ac.uk>> wrote:
>
> Hi Joel,
>
> Thanks a lot for the answer.
>
> "Each train/test split in cross_val_score holds out test data.
> GridSearchCV then splits each train set into (inner-)train and
> validation sets. "
>
> I know this is what nested CV supposed to do but the code is
> doing an excellent job at obscuring this. I'll try and add
> some clarification in as comments later today.
>
> Cheers,
>
> d
>
>
> On 29/11/16 00:07, Joel Nothman wrote:
>> If that clarifies, please offer changes to the example (as a
>> pull request) that make this clearer.
>>
>> On 29 November 2016 at 11:06, Joel Nothman
>> <joel.nothman at gmail.com <mailto:joel.nothman at gmail.com>> wrote:
>>
>> Briefly:
>>
>> clf = GridSearchCV
>> <http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV>(estimator=svr, param_grid=p_grid, cv=inner_cv)
>> nested_score = cross_val_score
>> <http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score>(clf, X=X_iris, y=y_iris, cv=outer_cv)
>>
>>
>> Each train/test split in cross_val_score holds out test
>> data. GridSearchCV then splits each train set into
>> (inner-)train and validation sets. There is no leakage of
>> test set knowledge from the outer loop into the grid
>> search optimisation; no leakage of validation set
>> knowledge into the SVR optimisation. The outer test data
>> are reused as training data, but within each split are
>> only used to measure generalisation error.
>>
>> Is that clear?
>>
>> On 29 November 2016 at 10:30, Daniel Homola
>> <dani.homola at gmail.com <mailto:dani.homola at gmail.com>> wrote:
>>
>> Dear all,
>>
>>
>> I was wondering if the following example code is valid:
>>
>> http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html
>> <http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html>
>>
>> My understanding is, that the point of nested
>> cross-validation is to prevent any data leakage from
>> the inner grid-search/param optimization CV loop into
>> the outer model evaluation CV loop. This could be
>> achieved if the outer CV loop's test data is
>> completely separated from the inner loop's CV, as
>> shown here:
>>
>> https://mlr-org.github.io/mlr-tutorial/release/html/img/nested_resampling.png
>> <https://mlr-org.github.io/mlr-tutorial/release/html/img/nested_resampling.png>
>>
>>
>> The code in the above example however doesn't seem to
>> achieve this in any way.
>>
>>
>> Am I missing something here?
>>
>>
>> Thanks a lot,
>>
>> dh
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org <mailto:scikit-learn at python.org>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> <https://mail.python.org/mailman/listinfo/scikit-learn>
>>
>>
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org <mailto:scikit-learn at python.org>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> <https://mail.python.org/mailman/listinfo/scikit-learn>
> _______________________________________________ scikit-learn
> mailing list scikit-learn at python.org
> <mailto:scikit-learn at python.org>
> https://mail.python.org/mailman/listinfo/scikit-learn
> <https://mail.python.org/mailman/listinfo/scikit-learn>
>
> _______________________________________________ scikit-learn
> mailing list scikit-learn at python.org
> <mailto:scikit-learn at python.org>
> https://mail.python.org/mailman/listinfo/scikit-learn
> <https://mail.python.org/mailman/listinfo/scikit-learn>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161129/330b0b66/attachment-0001.html>
More information about the scikit-learn
mailing list