[scikit-learn] How is linear regression in scikit-learn done? Do you need train and test split?

Matthieu Brucher matthieu.brucher at gmail.com
Wed Jun 5 02:43:28 EDT 2019


Hi CW,

It's not about the concept of the black box, none of the algorithms in
sklearn are a blackbox. The question is about model validity. Is linear
regression a valid representation of your data? That's what the train/test
answers. You may think so, but only this process will answer it properly.

Matthieu

Le mer. 5 juin 2019 à 01:46, C W <tmrsg11 at gmail.com> a écrit :

> Thank you all for the replies.
>
> I agree that prediction accuracy is great for evaluating black-box ML
> models. Especially advanced models like neural networks, or not-so-black
> models like LASSO, because they are NP-hard to solve.
>
> Linear regression is not a black-box. I view prediction accuracy as an
> overkill on interpretable models. Especially when you can use R-squared,
> coefficient significance, etc.
>
> Prediction accuracy also does not tell you which feature is important.
>
> What do you guys think? Thank you!
>
> .
>
> On Mon, Jun 3, 2019 at 11:43 AM Andreas Mueller <t3kcit at gmail.com> wrote:
>
>> This classical paper on statistical practices (Breiman's "two cultures")
>> might be helpful to understand the different viewpoints:
>>
>> https://projecteuclid.org/euclid.ss/1009213726
>>
>>
>> On 6/3/19 12:19 AM, Brown J.B. via scikit-learn wrote:
>>
>> As far as I understand: Holding out a test set is recommended if you
>>> aren't entirely sure that the assumptions of the model are held (gaussian
>>> error on a linear fit; independent and identically distributed samples).
>>> The model evaluation approach in predictive ML, using held-out data, relies
>>> only on the weaker assumption that the metric you have chosen, when applied
>>> to the test set you have held out, forms a reasonable measure of
>>> generalised / real-world performance. (Of course this too is often not held
>>> in practice, but it is the primary assumption, in my opinion, that ML
>>> practitioners need to be careful of.)
>>>
>>
>> Dear CW,
>> As Joel as said, holding out a test set will help you evaluate the
>> validity of model assumptions, and his last point (reasonable measure of
>> generalised performance) is absolutely essential for understanding the
>> capabilities and limitations of ML.
>>
>> To add to your checklist of interpreting ML papers properly, be cautious
>> when interpreting reports of high performance when using 5/10-fold or
>> Leave-One-Out cross-validation on large datasets, where "large" depends on
>> the nature of the problem setting.
>> Results are also highly dependent on the distributions of the underlying
>> independent variables (e.g., 60000 datapoints all with near-identical
>> distributions may yield phenomenal performance in cross validation and be
>> almost non-predictive in truly unknown/prospective situations).
>> Even at 500 datapoints, if independent variable distributions look
>> similar (with similar endpoints), then when each model is trained on 80% of
>> that data, the remaining 20% will certainly be predictable, and repeating
>> that five times will yield statistics that seem impressive.
>>
>> So, again, while problem context completely dictates ML experiment
>> design, metric selection, and interpretation of outcome, my personal rule
>> of thumb is to do no-more than 2-fold cross-validation (50% train, 50%
>> predict) when having 100+ datapoints.
>> Even more extreme, using try 33% for training and 66% for validation (or
>> even 20/80).
>> If your model still reports good statistics, then you can believe that
>> the patterns in the training data extrapolate well to the ones in the
>> external validation data.
>>
>> Hope this helps,
>> J.B.
>>
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Quantitative researcher, Ph.D.
Blog: http://blog.audio-tk.com/
LinkedIn: http://www.linkedin.com/in/matthieubrucher
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190605/058aee04/attachment-0001.html>


More information about the scikit-learn mailing list