[SciPy-Dev] chi-square test for a contingency (R x C) table

Wed Jun 2 14:39:17 EDT 2010

On 06/02/2010 01:18 PM, Neil Martinsen-Burrell wrote:
> On 2010-06-02 13:10 , Bruce Southey wrote:
> [...]
>
>>>> However, this code is the chi-squared test part as SAS will compute 
>>>> the
>>>> actual cell numbers. Also an extension to scipy.stats.chisquare() 
>>>> so we
>>>> can not have both functions.
>>>
>>> Again, I don't understand what you mean that we can't have both
>>> functions? I believe (from a statistics teacher's point of view) that
>>> the Chi-Squared goodness of fit test (which is stats.chisquare) is a
>>> different beast from the Chi-Square test for independence (which is
>>> stats.chisquare_contingency). The fact that the distribution of the
>>> test statistic is the same should not tempt us to put them into the
>>> same function.
>> Please read scipy.stats.chisquare() because scipy.stats.chisquare() is
>> the 1-d case of yours.
>> Quote from the docstring:
>> " The chi square test tests the null hypothesis that the categorical 
>> data
>> has the given frequencies."
>> Also go the web site provided in the docstring.
>>
>> By default you get the expected frequencies but you can also put in your
>> own using the f_exp variable. You could do the same in your code.
>
> In fact, Warren correctly used stats.chisquare with the expected 
> frequencies calculated from the null hypothesis and the corrected 
> degrees of freedom.  chisquare_contingency is in some sense a 
> convenience method for taking care of these pre-calculations before 
> calling stats.chisquare.  Can you explain more clearly to me why we 
> should not include such a convenience function?
I do not understand you here.

Clearly you have not read scipy.stats.chisquare() to know what it is 
doing. You should also read the cited url including the second part:
http://faculty.vassar.edu/lowry/ch8pt2.html

I don't see any 'pre-calculations' in the code. You have to compute the 
'expected value' for each cell because of the overall null hypothesis. 
Then you have to sum across all cells the value of 
(observed-expected)*(observed-expected)/expected to get the test 
statistic. That is trivial to do within the code and a waste of cpu time 
and memory to send it to another function to do that.

Bruce

>
>>>> Really this should be combined with fisher.py in ticket 956:
>>>> http://projects.scipy.org/scipy/ticket/956
>>>
>>> Wow, apparently I have lots of disagreements today, but I don't think
>>> that this should be combined with Fisher's Exact test. (I would like
>>> to see that ticket mature to the point where it can be added to
>>> scipy.stats.) I like the functions in scipy.stats to correspond in a
>>> one-to-one manner with the statistical tests. I think that the docs
>>> should "See Also" the appropriate exact (and non-parametric) tests,
>>> but I think that one function/one test is a good rule. This is
>>> particularly true for people (like me) who would like to someday be
>>> able to use scipy.stats in a pedagogical context.
>>>
>>> -Neil
>> I don't see any 'disagreements' rather just different ways to do things
>> and identifying areas that need to be addressed for more general use.
>
> Agreed. :)
>
> [...]
>
> -Neil

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20100602/62da7df4/attachment.html>