[SciPy-Dev] chi-square test for a contingency (R x C) table

Mon Jul 12 17:31:34 EDT 2010

On Sat, Jun 19, 2010 at 9:58 AM,  <josef.pktd at gmail.com> wrote:
> On Sat, Jun 19, 2010 at 9:26 AM, Warren Weckesser
> <warren.weckesser at enthought.com> wrote:
>> josef.pktd at gmail.com wrote:
>>> <snip>
>>>
>>> Forget any merging of the functions.
>>>
>>> Statistical functions should also be defined by their purpose, we are
>>> not creating universal f_tests and t_tests. Unless someone is
>>> proposing the merge and unify various t_tests, ... ?
>>> misquoting: "The user's hypothesis is totally irrelevant ..." ???
>>>
>>> Testing for goodness-of-fit is a completely different use case, with
>>> different extensions, e.g. power discrepancy. What if I have a 2d
>>> array and want to check goodness-of-fit along each axis, which might
>>> be useful once group-by extensions to bincount handle more than 1d
>>> weights.
>>
>>
>> So you are anticipating something like this (where `obs` is, say, 2D):
>>
>>  >>> chisquare_fit(table, axis=-1)
>>
>> Then the result would also be 2D, with the last axis having length 2 and
>> holding the (chi2, p) values?
>
> I haven't looked at this closely yet, but I would think it would be a
> standard reduce by one axis, usually we would return one array for the
> test statistic and one array for the p-values (both same dimension
> equal to one less than the original)
>
> chisquare_fit(table, axis=-1)  as equivalent to [chisquare(table[k])
> for k in range(table.size[0])] for 2d
> and apply_along_axis for nd
>
> This would be easy to extend but I don't know how much the need is for
> this currently.
>
> eg. if we have a sample by geographic region or groups, we might want
> to test whether the distribution is uniform or normal in each group.
> (continuous distributions would require binning first)
>
>>
>>>  Or if we extend it to multivariate distributions, then the
>>> default might be uniform for each column (and not independence.)
>>> This is a standard test for distributions, and should not be mixed
>>> with contingency tables
>>>
>>>
>>
>> Could you elaborate on this use case?  I don't know enough about it to
>> be able to decide if this is something that could be implemented right
>> away, or if it is something that might not happen for years, if ever.
>
> During this thread, I started to think of contingency tables just as a
> nd discrete distribution, where we can have functions for the
> multivariate distributions, marginal pdf, conditional pdf, ... and
> some tests on it.
> Independence in this case would be just one hypothesis.
> Also, the chisquare independence test conditions on the margin totals,
> this might be the most common case, but not necessarily the only
> chisquare hypothesis we might test. (I'm not to clear on all the
> contingency table stuff.)
>
> multivariate distributions are only on my wish list, and it will
> require some work to go beyond pdf, loglike and rvs.
> multivariate discrete (contingency tables without the statistics) and
> multivariate normal and some others would be the first candidates.
> (copulas would be another multivariate distribution wish)
>
> I don't know what would be the ETA (expected time of arrival) for these.
>
>
> I like your current implementation, because it's right to the point
> and easy to explain and use. And it looks forward compatible to
> extended functionality that we might think of.
>
> Josef
>
>>
>>
>>> contingency tables are a different case, which I never use, and where
>>> I would go with whatever statisticians prefer. But I think, going by
>>> null hypothesis makes functions for statistical tests much cleaner
>>> (easier to categorize, explain, find) than one-stop statistics (at
>>> least for functions and not methods in classes) as is the current
>>> tradition of scipy.stats.
>>>
>>> "fit" in your function name is very misleading chisquare_fit, because
>>> your function doesn't do any fitting. If a rename is desired, I would
>>> call it chisquare_gof, but I use a similar name for the actual gof
>>> test based on the sample data, with automatic binning.
>>> Fitting the distribution parameters raises other issues which I don't
>>> think should be mixed with the basic chisquare-test
>>>
>>>
>>
>> Yes, I agree.  I only used "fit" to distinguish it from "ind".  I didn't
>> want to use "oneway" and "nway", because those names might lead one to
>> think that "oneway" is the n=1 case of "nway", but it is not.
>>
>>
>> Warren
>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>

another reference
http://www.mathworks.com/access/helpdesk/help/toolbox/stats/crosstab.html

found when I was looking for something different and I never used it.

Josef