[SciPy-user] Usage of scipy KS test

Wed Jan 2 12:08:01 EST 2008

On 02/01/2008, Alexander Dietz <Alexander.Dietz at astro.cf.ac.uk> wrote:

> I am trying to use the KS test implemented to scipy.stats, but nowhere I
> could find an example on how to use this function, for my purposes.

It is indeed unfortunate that the man page doesn't have an example.
Here is one (in doctest format, I think, for easy inclusion into
scipy):

>>> import numpy
>>> import scipy.stats
>>> a = numpy.array([0.56957006,  0.81129082,  0.58896055,
0.63162055,  0.39305061, 0.92327368,  0.72176744,  0.69589162,
0.12716994,  0.80996302])
>>> scipy.stats.kstest(a, lambda x: x)
(0.26957006, array(0.19655500176460927))
>>> scipy.stats.kstest(a**4, lambda x: x)
(0.46678224511522154, array(0.0080924628974947677))

Let me explain: a was generated using numpy.random.uniform(size=10);
as you can see, I hope, they are uniformly distributed. Each time
scipy.stats.kstest it run, it returns two values: the KS D value
(which is not very meaningful) and the probability that such a
collection of values would be drawn from a distribution with a CDF
given by the second argument. You can see that a is reasonably likely
to have been drawn from a uniform distribution, but a**4 is not.

> Therefore let me describe what I have and what I want to do. I have three
> lists:
> x - vector of points on the x-axis
> y - vector of measured values for each of the x-points (cumulative
> distribution, first value:0.0, last value:1.0)
> m - vector containing values calculated from a model (cumulative
> distribution, first value: 0.0, last value:1.0)
>
> Each list has the same length. Now I want to test the hypothesis, that both
> vectors y and m are from the same distribution ( or not from the same
> distribution).
>
> I would very appreciate if someone could send me a concrete example using
> the vectors y and m.

This format is more complicated than what we need. scipy.stats.kstest
wants the list of (not necessarily sorted) x values, and a function
that evaluates the CDF. The simplest thing to do is provide it your
function that evaluates the CDF rather than computing m. If, however,
you have already computed m, you can cheat: scipy.stats.kstest only
needs to evaluate the function at the points in x, so you can create a
function based on dictionary lookup:

scipy.stats.kstest(x,dict(zip(x,m)).get)

This should return a tuple containing the KS D value and the
probability a data set like this one would be obtained from a
probability distribution with your CDF.

I should say, there's another mode scipy.stats.kstest can be used in:
you can give it a random number generator and the CDF of the
distribution it is supposed to generate, and it will see if the random
number generator is (with a reasonable probability) functioning
properly.

Is nose testing extensible enough to be able to mark (with a decorator
perhaps?) some tests as probabilistic, that is, a test which even a
correct function has a small chance of failing? The standard idiom for
such a test is to run it once, and if it fails run it again before
reporting failure.

Good luck,
Anne