[SciPy-User] ks_2samp is not giving the same results as ks.test in R

Thu Nov 1 21:14:25 EDT 2012

On Thu, Nov 1, 2012 at 8:28 PM, Peng Yu <pengyu.ut at gmail.com> wrote:
> Hi,
>
> The ks_2samp function does not give the same answer as ks.test in R.
> Does anybody know why they are different? Is ks_2samp compute
> something different?
>
> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ Rscript main.R
>> ks.test(1:5, 11:15)
>
>         Two-sample Kolmogorov-Smirnov test
>
> data:  1:5 and 11:15
> D = 1, p-value = 0.007937
> alternative hypothesis: two-sided
>
>> ks.test(1:5, 11:15, alternative='less')
>
>         Two-sample Kolmogorov-Smirnov test
>
> data:  1:5 and 11:15
> D^- = 0, p-value = 1
> alternative hypothesis: the CDF of x lies below that of y
>
>> ks.test(1:5, 11:15, alternative='greater')
>
>         Two-sample Kolmogorov-Smirnov test
>
> data:  1:5 and 11:15
> D^+ = 1, p-value = 0.006738
> alternative hypothesis: the CDF of x lies above that of y
>
>>
>>
> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ ./main.py
> (1.0, 0.0037813540593701006)
> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ cat main.py
> #!/usr/bin/env python
>
> from scipy.stats import ks_2samp
> print ks_2samp([1,2,3,4,5], [11,12,13,14,15])

R uses by default an "exact" distribution for small samples if there
are no ties.
If there are ties or with a large sample, R uses the asymptotic distribution.

If I read the function correctly, then scipy.stats is using a small
sample approximation by Stephens. (But I would have to look up the
formula to verify this.)

In the example below with a bit larger sample and no ties, our
approximation is closer to R's "exact" pvalue than the asymptotic
distribution if exact=FALSE.

>  ks.test(1:25, (10:30)-0.5, exact=FALSE)

        Two-sample Kolmogorov-Smirnov test

data:  1:25 and (10:30) - 0.5
D = 0.36, p-value = 0.1038
alternative hypothesis: two-sided

>  ks.test(1:25, (10:30)-0.5, exact=TRUE)

        Two-sample Kolmogorov-Smirnov test

data:  1:25 and (10:30) - 0.5
D = 0.36, p-value = 0.07608
alternative hypothesis: two-sided

>>> stats.ks_2samp(np.arange(1.,26), np.arange(10,31.)-0.5)
(0.35999999999999999, 0.078993426961291274)

For the 1 sample kstest I used (when I rewrote stats.kstest) an
approximation that is closer to the exact distribution than the
asymptotic distribution, but it's also not exact.

It would be good to have better small sample approximations or exact
distributions, but I worked on this in scipy.stats when I barely had
any idea about goodness-of-fit tests.
Also, ks_2samp never got the enhancement for one-sided alternatives.
(In statsmodels I have been working so far only on one sample tests,
but not on two-sample tests.)

(I don't remember if there is a minimum size recommendation, but the
examples I usually checked were larger.)

since it's a community project: Pull Request are welcome

Josef

>
>
> --
> Regards,
> Peng
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user