[SciPy-Dev] chi-square test for a contingency (R x C) table

Sat Jun 19 09:58:04 EDT 2010

On Sat, Jun 19, 2010 at 9:26 AM, Warren Weckesser
<warren.weckesser at enthought.com> wrote:
> josef.pktd at gmail.com wrote:
>> <snip>
>>
>> Forget any merging of the functions.
>>
>> Statistical functions should also be defined by their purpose, we are
>> not creating universal f_tests and t_tests. Unless someone is
>> proposing the merge and unify various t_tests, ... ?
>> misquoting: "The user's hypothesis is totally irrelevant ..." ???
>>
>> Testing for goodness-of-fit is a completely different use case, with
>> different extensions, e.g. power discrepancy. What if I have a 2d
>> array and want to check goodness-of-fit along each axis, which might
>> be useful once group-by extensions to bincount handle more than 1d
>> weights.
>
>
> So you are anticipating something like this (where `obs` is, say, 2D):
>
>  >>> chisquare_fit(table, axis=-1)
>
> Then the result would also be 2D, with the last axis having length 2 and
> holding the (chi2, p) values?

I haven't looked at this closely yet, but I would think it would be a
standard reduce by one axis, usually we would return one array for the
test statistic and one array for the p-values (both same dimension
equal to one less than the original)

chisquare_fit(table, axis=-1)  as equivalent to [chisquare(table[k])
for k in range(table.size[0])] for 2d
and apply_along_axis for nd

This would be easy to extend but I don't know how much the need is for
this currently.

eg. if we have a sample by geographic region or groups, we might want
to test whether the distribution is uniform or normal in each group.
(continuous distributions would require binning first)

>
>>  Or if we extend it to multivariate distributions, then the
>> default might be uniform for each column (and not independence.)
>> This is a standard test for distributions, and should not be mixed
>> with contingency tables
>>
>>
>
> Could you elaborate on this use case?  I don't know enough about it to
> be able to decide if this is something that could be implemented right
> away, or if it is something that might not happen for years, if ever.

During this thread, I started to think of contingency tables just as a
nd discrete distribution, where we can have functions for the
multivariate distributions, marginal pdf, conditional pdf, ... and
some tests on it.
Independence in this case would be just one hypothesis.
Also, the chisquare independence test conditions on the margin totals,
this might be the most common case, but not necessarily the only
chisquare hypothesis we might test. (I'm not to clear on all the
contingency table stuff.)

multivariate distributions are only on my wish list, and it will
require some work to go beyond pdf, loglike and rvs.
multivariate discrete (contingency tables without the statistics) and
multivariate normal and some others would be the first candidates.
(copulas would be another multivariate distribution wish)

I don't know what would be the ETA (expected time of arrival) for these.

I like your current implementation, because it's right to the point
and easy to explain and use. And it looks forward compatible to
extended functionality that we might think of.

Josef

>
>
>> contingency tables are a different case, which I never use, and where
>> I would go with whatever statisticians prefer. But I think, going by
>> null hypothesis makes functions for statistical tests much cleaner
>> (easier to categorize, explain, find) than one-stop statistics (at
>> least for functions and not methods in classes) as is the current
>> tradition of scipy.stats.
>>
>> "fit" in your function name is very misleading chisquare_fit, because
>> your function doesn't do any fitting. If a rename is desired, I would
>> call it chisquare_gof, but I use a similar name for the actual gof
>> test based on the sample data, with automatic binning.
>> Fitting the distribution parameters raises other issues which I don't
>> think should be mixed with the basic chisquare-test
>>
>>
>
> Yes, I agree.  I only used "fit" to distinguish it from "ind".  I didn't
> want to use "oneway" and "nway", because those names might lead one to
> think that "oneway" is the n=1 case of "nway", but it is not.
>
>
> Warren
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>