[scikit-learn] Scores in Cross Validation

Raga Markely raga.markely at gmail.com
Thu Jan 26 20:06:06 EST 2017


Got it.. thank you for the clarification, Sebastian & Guillaume..
appreciate it!

Best,
Raga

On Thu, Jan 26, 2017 at 6:41 PM, Guillaume Lemaître <g.lemaitre58 at gmail.com>
wrote:

> I didn't express myself well but I was meaning:
>
> > model selection via k-fold on the training set
>
> for the training/validation set :D
>
> On 27 January 2017 at 00:37, Sebastian Raschka <se.raschka at gmail.com>
> wrote:
>
>> > Furthermore, a training, validation, and testing set should be used when
>> > setting up
>> > parameters.
>>
>> Usually, it’s better to use a train set and separate test set, and do
>> model selection via k-fold on the training set. Then, you do the final
>> model estimation on the test set that you haven’t touched before. I often
>> use “training, validation, and testing “ approach as well, though,
>> especially when working with large datasets and for early stopping on
>> neural nets.
>>
>> Best,
>> Sebastian
>>
>>
>> > On Jan 26, 2017, at 1:19 PM, Raga Markely <raga.markely at gmail.com>
>> wrote:
>> >
>> > Thank you, Guillaume.
>> >
>> > 1. I agree with you - that's what I have been learning and makes
>> sense.. I was a bit surprised when I read the paper today..
>> >
>> > 2. Ah.. thank you.. I got to change my glasses :P
>> >
>> > Best,
>> > Raga
>> >
>> > Guillaume Lemaître g.lemaitre58 at gmail.com
>> > Thu Jan 26 12:05:12 EST 2017
>> >
>> >       • Previous message (by thread): [scikit-learn] Scores in Cross
>> Validation
>> >       • Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
>> > 1. You should not evaluate an estimator on the data which have been
>> used to
>> > train it.
>> > Usually, you try to minimize the classification or loss using those data
>> > and fit them as
>> > good as possible. Evaluating on an unseen testing set will give you an
>> idea
>> > how good
>> > your estimator was able to generalize to your problem during the
>> training.
>> > Furthermore, a training, validation, and testing set should be used when
>> > setting up
>> > parameters. Validation will be used to set the parameters and the
>> testing
>> > will be used
>> > to evaluate your best estimator.
>> >
>> > That is why, when using the GridSearchCV, fit will train the estimator
>> > using a training
>> > and validation test (using a given CV startegies). Finally, predict
>> will be
>> > performed on
>> > another unseen testing set.
>> >
>> > The bottom line is that using training data to select parameters will
>> not
>> > ensure that you
>> > are selecting the best parameters for your problems.
>> >
>> > 2. The function is call in _fit_and_score, l. 260 and 263 for instance.
>> >
>> > On 26 January 2017 at 17:02, Raga Markely <
>> > raga.markely at gmail.com
>> > > wrote:
>> >
>> > >
>> >  Hello,
>> >
>> > >
>> > >
>> >  I have 2 questions regarding cross_val_score.
>> >
>> > >
>> >  1. Do the scores returned by cross_val_score correspond to only the
>> test
>> >
>> > >
>> >  set or the whole data set (training and test sets)?
>> >
>> > >
>> >  I tried to look at the source code, and it looks like it returns the
>> score
>> >
>> > >
>> >  of only the test set (line 145: "return_train_score=False") - I am not
>> sure
>> >
>> > >
>> >  if I am reading the codes properly, though..
>> >
>> > > https://github.com/scikit-learn/scikit-learn/blob/14031f6/
>> > >
>> >  sklearn/model_selection/_validation.py#L36
>> >
>> > >
>> >  I came across the paper below and the authors use the score of the
>> whole
>> >
>> > >
>> >  dataset when the author performs repeated nested loop, grid search cv,
>> >
>> > >
>> >  etc.. e.g. see algorithm 1 (line 1c) and 2 (line 2d) on page 3.
>> >
>> > > https://jcheminf.springeropen.com/articles/10.1186/1758-2946-6-10
>> > >
>> >  I wonder what's the pros and cons of using the accuracy score of the
>> whole
>> >
>> > >
>> >  dataset vs just the test set.. any thoughts?
>> >
>> > >
>> > >
>> >  2. On line 283 of the cross_val_score source code, there is a function
>> >
>> > >
>> >  _score. However, I can't find where this function is called. Could you
>> let
>> >
>> > >
>> >  me know where this function is called?
>> >
>> > >
>> > >
>> >  Thank you very much!
>> >
>> > >
>> >  Raga
>> >
>> > >
>> > >
>> >  _______________________________________________
>> >
>> > >
>> >  scikit-learn mailing list
>> >
>> > > scikit-learn at python.org
>> > > https://mail.python.org/mailman/listinfo/scikit-learn
>> > >
>> > >
>> >
>> >
>> > --
>> > Guillaume Lemaitre
>> > INRIA Saclay - Ile-de-France
>> > Equipe PARIETAL
>> >
>> > guillaume.lemaitre at inria.f <guillaume.lemaitre at inria.fr
>> > >r ---
>> >
>> > https://glemaitre.github.io/
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
>
> --
> Guillaume Lemaitre
> INRIA Saclay - Ile-de-France
> Equipe PARIETAL
> guillaume.lemaitre at inria.f <guillaume.lemaitre at inria.fr>r ---
> https://glemaitre.github.io/
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170126/b9e12ddd/attachment.html>


More information about the scikit-learn mailing list