[SciPy-user] new Kolmogorov-Smirnov test
Roban Hultman Kramer
roban at astro.columbia.edu
Wed Dec 3 12:35:13 EST 2008
Is there a correct two-sample k-s test in scipy?
On Sat, Nov 29, 2008 at 4:46 PM, <josef.pktd at gmail.com> wrote:
> Since the old scipy.stats.kstest wasn't correct, I spent quite some
> time fixing and testing it. Now, I know more about the
> Kolmogorov-Smirnov test, than I wanted to.
>
> The kstest now resembles the one in R and in matlab, giving the option
> for two-sided or one-sided tests. The names of the keyword options are
> a mixture of matlab and R, which I liked best.
>
> Since the exact distribution of the two-sided test is not available in
> scipy, I use an approximation, that seems to work very well. In
> several Monte Carlo studies against R, I get very close results,
> especially for small p-values. (For those interested, for small
> p-values, I use ksone.sf(D,n)*2; for large p-values or large n, I use
> the asymptotic distribution kstwobign)
>
> example signature and options:
> kstest(x,testdistfn.name, alternative = 'unequal', mode='approx'))
> kstest(x,testdistfn.name, alternative = 'unequal', mode='asymp'))
> kstest(x,testdistfn.name, alternative = 'larger'))
> kstest(x,testdistfn.name, alternative = 'smaller'))
>
> Below is the Monte Carlo for the case, when the random variable and
> the hypothesized distribution both are standard normal (with sample
> size 100 and 1000 replications. Rejection rates are very close to
> alpha levels. It also contains the mean absolute error MAE for the old
> kstest. I also checked for mean shifted normal random variables. In
> all cases that I tried, I get exactly the same rejection rates as in
> R.
>
> For details see doc string or source.
> I attach file to a separate email, to get around attachment size limit.
>
> I intend to put this in trunk tomorrow, review and comments are welcome.
>
> Josef
>
>
> data generation distribution is norm, hypothesis is norm
> ==================================================
> n = 100, loc = 0.000000 scale = 1.000000, n_repl = 1000
> columns: D, pval
> rows are
> kstest(x,testdistfn.name, alternative = 'unequal', mode='approx'))
> kstest(x,testdistfn.name, alternative = 'unequal', mode='asymp'))
> kstest(x,testdistfn.name, alternative = 'larger'))
> kstest(x,testdistfn.name, alternative = 'smaller'))
>
> Results for comparison with R:
>
> MAE old kstest
> [[ 0.00453195 0.19152727]
> [ 0.00453195 0.2101139 ]
> [ 0.02002774 0.19145982]
> [ 0.02880553 0.26650226]]
> MAE new kstest
> [[ 1.87488913e-17 1.07738517e-02]
> [ 1.87488913e-17 1.91763848e-06]
> [ 2.38576520e-17 8.90287843e-16]
> [ 1.41312743e-17 9.92428362e-16]]
> percent count absdev > 0.005
> [[ 0. 53.9]
> [ 0. 0. ]
> [ 0. 0. ]
> [ 0. 0. ]]
> percent count absdev > 0.01
> [[ 0. 24.3]
> [ 0. 0. ]
> [ 0. 0. ]
> [ 0. 0. ]]
> percent count abs percent dev > 1%
> [[ 0. 51.8]
> [ 0. 0. ]
> [ 0. 0. ]
> [ 0. 0. ]]
> percent count abs percent dev > 10%
> [[ 0. 0.]
> [ 0. 0.]
> [ 0. 0.]
> [ 0. 0.]]
> new: count rejection at 1% significance
> [ 0.01 0.008 0.009 0.014]
> R: proportion of rejection at 1% significance
> [ 0.01 0.008 0.009 0.014]
> new: proportion of rejection at 5% significance
> [ 0.054 0.048 0.048 0.06 ]
> R: proportion of rejection at 5% significance
> [ 0.054 0.048 0.048 0.06 ]
> new: proportion of rejection at 10% significance
> [ 0.108 0.096 0.095 0.109]
> R: proportion of rejection at 10% significance
> [ 0.108 0.096 0.095 0.109]
> _______________________________________________
> SciPy-user mailing list
> SciPy-user at scipy.org
> http://projects.scipy.org/mailman/listinfo/scipy-user
>
More information about the SciPy-User
mailing list