[SciPy-Dev] chi-square test for a contingency (R x C) table

Thu Jun 3 00:47:50 EDT 2010

On 2010-06-02 15:03 , Bruce Southey wrote:
> On 06/02/2010 01:41 PM, josef.pktd at gmail.com wrote:
>> On Wed, Jun 2, 2010 at 2:18 PM, Neil Martinsen-Burrell<nmb at wartburg.edu>  wrote:
>>
>>> On 2010-06-02 13:10 , Bruce Southey wrote:

[...]

>> I agree with Neil that this is a very useful convenience function.
>>
> My problem with the chisquare_twoway is that it should not call another
> function to finish two lines of code. It is just an excessive waste of
> resources.

Do you mean that you would rather see the equivalent of

chisq = (table - expected)**2 / expected
return chisq, chisqprob(chisq, dof)

at the bottom of chisquare_contingency than the current call to 
chisquare?  I'm certainly okay with that.

>> I never heard of a one-way contingency table, my question was whether
>> the function should also handle 3-way or 4-way tables, additional to
>> two-way.
>>
> Correct to both of these as I just consider these as n-way tables. I
> think that contingency tables by definition only applies to the 2-d
> case. Pivot tables are essentially the same thing. I would have to
> lookup on how to get the expected number of outcomes but probably of the
> form Ni.. * N.j. *N..k/N... for the 3-way (the 2-way table is of the
> form Ni.*N.j/N..) for i=rows, j=columns, k=3rd axis and '.' means sum
> for that axis.

That is the correct (tensor) formula for higher dimensional tables. 
Pragmatically, since the number of cells climbs so rapidly with 
increasing dimension, there are more problems with small expected 
counts.  If we thought people would be interested in using it, we could 
certainly define a chisquare_nway function as well.

>> I thought about the question how the input should be specified for my
>> initial response, the alternative would be to use the original data or
>> a "long" format instead of a table. But I thought that as a
>> convenience function using the table format will be the most common
>> use.
>> I have written in the past functions that calculate the contingency
>> table, and would be very useful to have a more complete coverage of
>> tools to work with contingency tables in scipy.stats (or temporarily
>> in statsmodels, where we are working also on the anova type of
>> analysis)
>>
> It depends on what tasks are needed. Really there are two steps:
> 1) Cross-tabulation that summarized the data from whatever input
> (groupby would help here).
> 2) Statistical tests - series of functions that accept summarized data only.
>
> If you have separate functions then the burden is on the user to find
> and call all the desired functions. You can also provide a single helper
> function to do all that because you don't want to repeat unnecessary calls.

The facilities for handling raw, frame-style data in scipy.stats are not 
too strong.  A tabulation function that we could stick together with the 
chisquare* functions to make a single helper would certainly be convenient.

-Neil