[SciPy-Dev] chi-square test for a contingency (R x C) table

Thu Jun 3 09:27:29 EDT 2010

Just letting you know that I'm not ignoring all the great comments from 
josef, Neil and Bruce about my suggestion for chisquare_contingency. 
Unfortunately, I won't have time to think about all the deeper 
suggestions for another week or so.   For now, I'll just say that I 
agree with josef's and Neil's suggestions for the docstring, and that 
Neil's summary of the function as simply a convenience function that 
calls stats.chisquare with appropriate arguments to perform a test of 
independence on a contingency table is exactly what I had in mind.

Warren

josef.pktd at gmail.com wrote:
> On Wed, Jun 2, 2010 at 4:03 PM, Bruce Southey <bsouthey at gmail.com> wrote:
>   
>> On 06/02/2010 01:41 PM, josef.pktd at gmail.com wrote:
>>
>> On Wed, Jun 2, 2010 at 2:18 PM, Neil Martinsen-Burrell <nmb at wartburg.edu>
>> wrote:
>>
>>
>> On 2010-06-02 13:10 , Bruce Southey wrote:
>> [...]
>>
>>
>>
>> However, this code is the chi-squared test part as SAS will compute the
>> actual cell numbers. Also an extension to scipy.stats.chisquare() so we
>> can not have both functions.
>>
>>
>> Again, I don't understand what you mean that we can't have both
>> functions? I believe (from a statistics teacher's point of view) that
>> the Chi-Squared goodness of fit test (which is stats.chisquare) is a
>> different beast from the Chi-Square test for independence (which is
>> stats.chisquare_contingency). The fact that the distribution of the
>> test statistic is the same should not tempt us to put them into the
>> same function.
>>
>>
>> Please read scipy.stats.chisquare() because scipy.stats.chisquare() is
>> the 1-d case of yours.
>> Quote from the docstring:
>> " The chi square test tests the null hypothesis that the categorical data
>> has the given frequencies."
>> Also go the web site provided in the docstring.
>>
>> By default you get the expected frequencies but you can also put in your
>> own using the f_exp variable. You could do the same in your code.
>>
>>
>> In fact, Warren correctly used stats.chisquare with the expected
>> frequencies calculated from the null hypothesis and the corrected
>> degrees of freedom.  chisquare_contingency is in some sense a
>> convenience method for taking care of these pre-calculations before
>> calling stats.chisquare.  Can you explain more clearly to me why we
>> should not include such a convenience function?
>>
>>
>> Just a clarification, before I find time to work my way through the
>> other comments
>>
>> stats.chisquare is a generic test for goodness-of-fit for discreted or
>> binned distributions.
>> and from the docstring of it
>> "If no expected frequencies are given, the total
>>     N is assumed to be equally distributed across all groups."
>>
>> default is uniform distribution
>>
>>
>>
>> Try:
>> http://en.wikipedia.org/wiki/Pearson's_chi-square_test
>>
>> The use of the uniform distribution is rather misleading and technically
>> wrong as it does not help address the expected number of outcomes in a cell:
>>     
>
> quote from the wikipedia page:
> "A simple example is the hypothesis that an ordinary six-sided dice is
> "fair", i.e., all six outcomes are equally likely to occur."
>
> I don't see anything misleading or technically wrong with the uniform
> distributions,
> or if they come from a Poisson, Hypergeometric, binned Normal or any
> of number of other distributions.
>
>
>   
>> http://en.wikipedia.org/wiki/Discrete_uniform_distribution
>>
>>
>> chisquare_twoway is a special case that additional calculates the
>> correct expected frequencies for the test of independencs based on the
>> margin totals. The resulting distribution is not uniform.
>>
>>
>> Actually the null hypothesis is rather different between 1-way and 2-way
>> tables so you can not say that chisquare_twoway is a special case of
>> chisquare.
>>     
>
> What is the Null hypothesis in a one-way table?
>
> Josef
>
>   
>> I am not sure what you mean by the 'resulting distribution is not uniform'.
>> The distribution of the cells values has nothing to do with the uniform
>> distribution in either case because it is not used in the data nor in the
>> formulation of the test. (And, yes, I have had to do the proof that the test
>> statistic is Chi-squared - which is why there is the warning about small
>> cells...).
>>
>> I agree with Neil that this is a very useful convenience function.
>>
>>
>> My problem with the chisquare_twoway is that it should not call another
>> function to finish two lines of code. It is just an excessive waste of
>> resources.
>>
>> I never heard of a one-way contingency table, my question was whether
>> the function should also handle 3-way or 4-way tables, additional to
>> two-way.
>>
>>
>> Correct to both of these as I just consider these as n-way tables. I think
>> that contingency tables by definition only applies to the 2-d case. Pivot
>> tables are essentially the same thing. I would have to lookup on how to get
>> the expected number of outcomes but probably of the form Ni.. * N.j.
>> *N..k/N... for the 3-way (the 2-way table is of the form Ni.*N.j/N..) for
>> i=rows, j=columns, k=3rd axis and '.' means sum for that axis.
>>
>> I thought about the question how the input should be specified for my
>> initial response, the alternative would be to use the original data or
>> a "long" format instead of a table. But I thought that as a
>> convenience function using the table format will be the most common
>> use.
>>
>> I have written in the past functions that calculate the contingency
>> table, and would be very useful to have a more complete coverage of
>> tools to work with contingency tables in scipy.stats (or temporarily
>> in statsmodels, where we are working also on the anova type of
>> analysis)
>>
>>
>> It depends on what tasks are needed.  Really there are two steps:
>> 1) Cross-tabulation that summarized the data from whatever input (groupby
>> would help here).
>> 2) Statistical tests - series of functions that accept summarized data only.
>>
>> If you have separate functions then the burden is on the user to find and
>> call all the desired functions. You can also provide a single helper
>> function to do all that because you don't want to repeat unnecessary calls.
>>
>> So, I think the way it is it is a nice function and we don't have to
>> put all contingency table analysis into this function.
>>
>> Josef
>>
>>
>> Bruce
>>
>>
>>
>>
>>
>> Really this should be combined with fisher.py in ticket 956:
>> http://projects.scipy.org/scipy/ticket/956
>>
>>
>> Wow, apparently I have lots of disagreements today, but I don't think
>> that this should be combined with Fisher's Exact test. (I would like
>> to see that ticket mature to the point where it can be added to
>> scipy.stats.) I like the functions in scipy.stats to correspond in a
>> one-to-one manner with the statistical tests. I think that the docs
>> should "See Also" the appropriate exact (and non-parametric) tests,
>> but I think that one function/one test is a good rule. This is
>> particularly true for people (like me) who would like to someday be
>> able to use scipy.stats in a pedagogical context.
>>
>> -Neil
>>
>>
>> I don't see any 'disagreements' rather just different ways to do things
>> and identifying areas that need to be addressed for more general use.
>>
>>
>> Agreed. :)
>>
>> [...]
>>
>> -Neil
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>>
>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>>
>>     
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>