[SciPy-user] new Kolmogorov-Smirnov test

Wed Dec 3 12:35:13 EST 2008

Is there a correct two-sample k-s test in scipy?

On Sat, Nov 29, 2008 at 4:46 PM,  <josef.pktd at gmail.com> wrote:
> Since the old scipy.stats.kstest wasn't correct, I spent quite some
> time fixing and testing it. Now, I know more about the
> Kolmogorov-Smirnov test, than I wanted to.
>
> The kstest now resembles the one in R and in matlab, giving the option
> for two-sided or one-sided tests. The names of the keyword options are
> a mixture of matlab and R, which I liked best.
>
> Since the exact distribution of the two-sided test is not available in
> scipy, I use an approximation, that seems to work very well. In
> several Monte Carlo studies against R, I get very close results,
> especially for small p-values. (For those interested, for small
> p-values, I use ksone.sf(D,n)*2; for large p-values or large n, I use
> the asymptotic distribution kstwobign)
>
> example signature and options:
>        kstest(x,testdistfn.name, alternative = 'unequal', mode='approx'))
>        kstest(x,testdistfn.name, alternative = 'unequal', mode='asymp'))
>        kstest(x,testdistfn.name, alternative = 'larger'))
>        kstest(x,testdistfn.name, alternative = 'smaller'))
>
> Below is the Monte Carlo for the case, when the random variable and
> the hypothesized distribution both are standard normal (with sample
> size 100 and 1000 replications. Rejection rates are very close to
> alpha levels. It also contains the mean absolute error MAE for the old
> kstest. I also checked for mean shifted normal random variables. In
> all cases that I tried, I get exactly the same rejection rates as in
> R.
>
> For details see doc string or source.
> I attach file to a separate email, to get around attachment size limit.
>
> I intend to put this in trunk tomorrow, review and comments are welcome.
>
> Josef
>
>
> data generation distribution is norm, hypothesis is norm
> ==================================================
> n = 100, loc = 0.000000 scale = 1.000000, n_repl = 1000
> columns: D, pval
>        rows are
>        kstest(x,testdistfn.name, alternative = 'unequal', mode='approx'))
>        kstest(x,testdistfn.name, alternative = 'unequal', mode='asymp'))
>        kstest(x,testdistfn.name, alternative = 'larger'))
>        kstest(x,testdistfn.name, alternative = 'smaller'))
>
> Results for comparison with R:
>
> MAE old kstest
> [[ 0.00453195  0.19152727]
>  [ 0.00453195  0.2101139 ]
>  [ 0.02002774  0.19145982]
>  [ 0.02880553  0.26650226]]
> MAE new kstest
> [[  1.87488913e-17   1.07738517e-02]
>  [  1.87488913e-17   1.91763848e-06]
>  [  2.38576520e-17   8.90287843e-16]
>  [  1.41312743e-17   9.92428362e-16]]
> percent count absdev > 0.005
> [[  0.   53.9]
>  [  0.    0. ]
>  [  0.    0. ]
>  [  0.    0. ]]
> percent count absdev > 0.01
> [[  0.   24.3]
>  [  0.    0. ]
>  [  0.    0. ]
>  [  0.    0. ]]
> percent count abs percent dev > 1%
> [[  0.   51.8]
>  [  0.    0. ]
>  [  0.    0. ]
>  [  0.    0. ]]
> percent count abs percent dev > 10%
> [[ 0.  0.]
>  [ 0.  0.]
>  [ 0.  0.]
>  [ 0.  0.]]
> new: count rejection at 1% significance
> [ 0.01   0.008  0.009  0.014]
> R: proportion of rejection at 1% significance
> [ 0.01   0.008  0.009  0.014]
> new: proportion of rejection at 5% significance
> [ 0.054  0.048  0.048  0.06 ]
> R: proportion of rejection at 5% significance
> [ 0.054  0.048  0.048  0.06 ]
> new: proportion of rejection at 10% significance
> [ 0.108  0.096  0.095  0.109]
> R: proportion of rejection at 10% significance
> [ 0.108  0.096  0.095  0.109]
> _______________________________________________
> SciPy-user mailing list
> SciPy-user at scipy.org
> http://projects.scipy.org/mailman/listinfo/scipy-user
>