[SciPy-user] new Kolmogorov-Smirnov test

Sat Nov 29 16:46:14 EST 2008

Since the old scipy.stats.kstest wasn't correct, I spent quite some
time fixing and testing it. Now, I know more about the
Kolmogorov-Smirnov test, than I wanted to.

The kstest now resembles the one in R and in matlab, giving the option
for two-sided or one-sided tests. The names of the keyword options are
a mixture of matlab and R, which I liked best.

Since the exact distribution of the two-sided test is not available in
scipy, I use an approximation, that seems to work very well. In
several Monte Carlo studies against R, I get very close results,
especially for small p-values. (For those interested, for small
p-values, I use ksone.sf(D,n)*2; for large p-values or large n, I use
the asymptotic distribution kstwobign)

example signature and options:
        kstest(x,testdistfn.name, alternative = 'unequal', mode='approx'))
        kstest(x,testdistfn.name, alternative = 'unequal', mode='asymp'))
        kstest(x,testdistfn.name, alternative = 'larger'))
        kstest(x,testdistfn.name, alternative = 'smaller'))

Below is the Monte Carlo for the case, when the random variable and
the hypothesized distribution both are standard normal (with sample
size 100 and 1000 replications. Rejection rates are very close to
alpha levels. It also contains the mean absolute error MAE for the old
kstest. I also checked for mean shifted normal random variables. In
all cases that I tried, I get exactly the same rejection rates as in
R.

For details see doc string or source.
I attach file to a separate email, to get around attachment size limit.

I intend to put this in trunk tomorrow, review and comments are welcome.

Josef

data generation distribution is norm, hypothesis is norm
==================================================
n = 100, loc = 0.000000 scale = 1.000000, n_repl = 1000
columns: D, pval
        rows are
        kstest(x,testdistfn.name, alternative = 'unequal', mode='approx'))
        kstest(x,testdistfn.name, alternative = 'unequal', mode='asymp'))
        kstest(x,testdistfn.name, alternative = 'larger'))
        kstest(x,testdistfn.name, alternative = 'smaller'))

Results for comparison with R:

MAE old kstest
[[ 0.00453195  0.19152727]
 [ 0.00453195  0.2101139 ]
 [ 0.02002774  0.19145982]
 [ 0.02880553  0.26650226]]
MAE new kstest
[[  1.87488913e-17   1.07738517e-02]
 [  1.87488913e-17   1.91763848e-06]
 [  2.38576520e-17   8.90287843e-16]
 [  1.41312743e-17   9.92428362e-16]]
percent count absdev > 0.005
[[  0.   53.9]
 [  0.    0. ]
 [  0.    0. ]
 [  0.    0. ]]
percent count absdev > 0.01
[[  0.   24.3]
 [  0.    0. ]
 [  0.    0. ]
 [  0.    0. ]]
percent count abs percent dev > 1%
[[  0.   51.8]
 [  0.    0. ]
 [  0.    0. ]
 [  0.    0. ]]
percent count abs percent dev > 10%
[[ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]]
new: count rejection at 1% significance
[ 0.01   0.008  0.009  0.014]
R: proportion of rejection at 1% significance
[ 0.01   0.008  0.009  0.014]
new: proportion of rejection at 5% significance
[ 0.054  0.048  0.048  0.06 ]
R: proportion of rejection at 5% significance
[ 0.054  0.048  0.048  0.06 ]
new: proportion of rejection at 10% significance
[ 0.108  0.096  0.095  0.109]
R: proportion of rejection at 10% significance
[ 0.108  0.096  0.095  0.109]