[SciPy-user] kstest and scipy.stats

Thu Nov 20 11:56:45 EST 2008

Hi,

I am having trouble using kstest and the scipy.stats package which I
suspect is due to a misunderstanding.

Basically I'm confused by the below:
O is an array of observed (integer) values:
In [344]: O.shape
Out[344]: (1400,)
In [345]: O.max()
Out[345]: 21
In [346]: O.min()
Out[346]: 0

Now I am trying to use the kstest to determine how closely they
described this vector of data. But I was getting low values with
kstest (always p of zero - even when plotting the distributions shows
that by eye they are a very good fit).

But the thing that really confuses me is this:
In [337]: kstest(O,
stats.rv_discrete(name='test',values=(r_[0:25],prob(O,25))).cdf)
Out[337]: (0.31071428571428572, 0.0)

Prob is a small function of mine that returns a probability vector
from a vector of integers (shown below - I have been using it for ages
and I'm sure there is no mistake there). rv_discrete seems to
construct the right distribution (mean and so on match) - so how come
the p value is 0, when I am comparing to the distribution directly
sampled from the data?

Any help greatfully appreciated,

Robin

----
Source:
def prob(x, r):
    """Sample probabity of integer sequence using bincount

    Inputs:
    x - integer sequence
    r - number of possible responses (max(x)<r)

    Returns full probability vector (float)

    """
    if (not np.issubdtype(x.dtype, np.int)):
        raise ValueError, "Input must be of integer type"
    P = np.bincount(x).astype(np.float)
    n = P.size
    if n < r:   # resize if any responses missed
        P.resize((r,))
        P[n:]=0
    P /= x.size
    return P