[scikit-learn] Support Vector Machines: Sensitive to Single Datapoints?

Fri Dec 22 06:20:55 EST 2017

Hello,

Yes, Gael's paper points out some fundamental issues! In your case, the
practical question is to know what kind of cross validation scheme you
used... If you originally used StratifiedKFold, try to re-run your
experiments with StratifiedShuffleSplit and a large number of splits!
Hopefully, increasing the number of splits should reduce the discrepancy
you observe between the two mean accuracies... But as Gael says, the small
sample size brings fundamental limitations to what you can measure...

Sylvain

On Tue, Dec 19, 2017 at 10:35 PM, Gael Varoquaux <
gael.varoquaux at normalesup.org> wrote:

> With as few data points, there is a huge uncertainty in the estimation of
> the prediction accuracy with cross-validation. This isn't a problem of
> the method, is it a basic limitation of the small amount of data. I've
> written a paper on this problem is the specific context of neuroimaging:
> https://www.sciencedirect.com/science/article/pii/S1053811917305311
> (preprint: https://hal.inria.fr/hal-01545002/).
>
> I except that what you are seing in sampling noise: the result has
> confidence intervals in large than 10%.
>
> Gaël
>
>
> On Tue, Dec 19, 2017 at 04:27:53PM -0500, Taylor, Johnmark wrote:
> > Hello,
>
> > I am a researcher in fMRI and am using SVMs to analyze brain data. I am
> doing
> > decoding between two classes, each of which has 24 exemplars per class.
> I am
> > comparing two different methods of cross-validation for my data: in one,
> I am
> > training on 23 exemplars from each class, and testing on the remaining
> example
> > from each class, and in the other, I am training on 22 exemplars from
> each
> > class, and testing on the remaining two from each class (in case it
> matters,
> > the data is structured into different neuroimaging "runs", with each
> "run"
> > containing several "blocks"; the first cross-validation method is
> leaving out
> > one block at a time, the second is leaving out one run at a time).
>
> > Now, I would've thought that these two CV methods would be very similar,
> since
> > the vast majority of the training data is the same; the only difference
> is in
> > adding two additional points. However, they are yielding very different
> > results: training on 23 per class is yielding 60% decoding accuracy
> (averaged
> > across several subjects, and statistically significantly greater than
> chance),
> > training on 22 per class is yielding chance (50%) decoding. Leaving
> aside the
> > particulars of fMRI in this case: is it unusual for single points
> (amounting to
> > less than 5% of the data) to have such a big influence on SVM decoding?
> I am
> > using a cost parameter of C=1. I must say it is counterintuitive to me
> that
> > just a couple points out of two dozen could make such a big difference.
>
> > Thank you very much, and cheers,
>
> > JohnMark
>
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> --
>     Gael Varoquaux
>     Senior Researcher, INRIA Parietal
>     NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>     Phone:  ++ 33-1-69-08-79-68
>     http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>

-- 
Sylvain Takerkart

Institut des Neurosciences de la Timone (INT)
UMR 7289 CNRS-AMU
Marseille, France
tél: +33 (0)4 91 324 007
http://www.int.univ-amu.fr/_TAKERKART-Sylvain_?lang=en
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171222/a24d44ad/attachment-0001.html>