[scikit-learn] anti-correlated predictions by SVR

Tue Sep 26 12:56:12 EDT 2017

I took my example in classification for didactic purposes. My hypothesis still holds that the splitting of the data creates anti correlations between train and test (a depletion effect).

Basically , you shouldn't work with datasets that small. 

Gaël

⁣Sent from my phone, please excuse typos and briefness

On Sep 26, 2017, 18:51, at 18:51, Thomas Evangelidis <tevang3 at gmail.com> wrote:
>I have very small training sets (10-50 observations). Currently, I am
>working with 16 observations for training and 25 for validation
>(external
>test set). And I am doing Regression, not Classification (hence the SVR
>instead of SVC).
>
>
>On 26 September 2017 at 18:21, Gael Varoquaux
><gael.varoquaux at normalesup.org
>> wrote:
>
>> Hypothesis: you have a very small dataset and when you leave out
>data,
>> you create a distribution shift between the train and the test. A
>> simplified example: 20 samples, 10 class a, 10 class b. A
>leave-one-out
>> cross-validation will create a training set of 10 samples of one
>class, 9
>> samples of the other, and the test set is composed of the class that
>is
>> minority on the train set.
>>
>> G
>>
>> On Tue, Sep 26, 2017 at 06:10:39PM +0200, Thomas Evangelidis wrote:
>> > Greetings,
>>
>> > I don't know if anyone encountered this before, but sometimes I get
>> > anti-correlated predictions by the SVR I that am training. Namely,
>the
>> > Pearson's R and Kendall's tau are negative when I compare the
>> predictions on
>> > the external test set with the true values. However, the SVR
>predictions
>> on the
>> > training set have positive correlations with the experimental
>values and
>> hence
>> > I can't think of a way to know in advance if the trained SVR will
>produce
>> > anti-correlated predictions in order to change their sign and avoid
>the
>> > disaster. Here is an example of what I mean:
>>
>> > Training set predictions: R=0.452422, tau=0.333333
>> > External test set predictions: R=-0.537420, tau-0.300000
>>
>> > Obviously, in a real case scenario where I wouldn't have the
>external
>> test set
>> > I would have used the worst observation instead of the best ones.
>Has
>> anybody
>> > any idea about how I could prevent this?
>>
>> > thanks in advance
>> > Thomas
>> --
>>     Gael Varoquaux
>>     Researcher, INRIA Parietal
>>     NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>>     Phone:  ++ 33-1-69-08-79-68
>>     http://gael-varoquaux.info           
>http://twitter.com/GaelVaroquaux
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
>
>-- 
>
>======================================================================
>
>Dr Thomas Evangelidis
>
>Post-doctoral Researcher
>CEITEC - Central European Institute of Technology
>Masaryk University
>Kamenice 5/A35/2S049,
>62500 Brno, Czech Republic
>
>email: tevang at pharm.uoa.gr
>
>          tevang3 at gmail.com
>
>
>website: https://sites.google.com/site/thomasevangelidishomepage/
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>scikit-learn mailing list
>scikit-learn at python.org
>https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170926/f00ec902/attachment-0001.html>