[SciPy-Dev] chi-square test for a contingency (R x C) table

Wed Jun 2 12:02:02 EDT 2010

On 06/02/2010 09:37 AM, josef.pktd at gmail.com wrote:
> On Wed, Jun 2, 2010 at 8:24 AM, Neil Martinsen-Burrell<nmb at wartburg.edu>  wrote:
>    
>> On 2010-06-01 23:28 , Warren Weckesser wrote:
>>      
>>> I've been digging into some basic statistics recently, and developed the
>>> following function for applying the chi-square test to a contingency
>>> table.  Does something like this already exist in scipy.stats? If not,
>>> any objects to adding it?  (Tests are already written :)
>>>        
>> Something like this would be great in scipy.stats since I end up doing
>> the exact same thing by hand whenever I grade introductory statistics
>> exams.  Thanks for writing this!
>>      
You might find SAS helpful:
http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#/documentation/cdl/en/procstat/63104/HTML/default/freq_toc.htm

However, this code is the chi-squared test part as SAS will compute the 
actual cell numbers. Also an extension to scipy.stats.chisquare() so we 
can not have both functions.

Really this should be combined with fisher.py in ticket 956:
http://projects.scipy.org/scipy/ticket/956

>> I've got some code review comments that I'll include below.
>>
>>      
>>> def chisquare_contingency(table):
>>>        
>> I think that chiquare_twoway fits the common name for this test better,
>> but as Joseph mentions, this neglects the possibility of expanding this
>> to n-dimensions.
>>
>>      
>>>       """Chi-square calculation for a contingency (R x C) table.
>>>        
>> The docstring should emphasize that this is a hypothesis test.  See for
>> example http://docs.scipy.org/scipy/docs/scipy.stats.stats.ttest_rel/.
>> I'm not familiar with the R x C notation, but it does work to make clear
>> which chi square test this is.
>>
>>      
>>>       This function computes the chi-square statistic and p-value of the
>>>       data in the table.  The expected frequencies are computed based on
>>>       the relative frequencies in the table.
>>>        
>> I try to explain what the null and alternative hypotheses are for the
>> tests in scipy.stats.
>>      
It is also an asymptotic test so cell size should be mentioned.

>>      
>>>       Parameters
>>>       ----------
>>>       table : array_like, 2D
>>>           The contingency table, also known as the R x C table.
>>>        
>> This could also say something like "The table contains the observed
>> frequencies of each category."
>>
>>      
>>>       Returns
>>>       -------
>>>       chisquare statistic : float
>>>           The chisquare test statistic
>>>       p : float
>>>           The p-value of the test.
>>>        
>> A function like this could really use an example, perhaps straight from
>> one of the tests.
>>      
It needs at least to support both the 1-d and 2-d cases (preferably 
where R and C > 2)
>>>       """
>>>       table = np.asarray(table)
>>>       if table.ndim != 2:
>>>           raise ValueError("table must be a 2D array.")
>>>        
This should not be restricted to 2-d array's. At the very least it 
should handle 1-d and 2-d array_like inputs. There also should have 
correct handling of masked arrays because np.asarray ignores the mask - 
I do not recall what happens with Matrix class. Obviously one needs to 
address how masked values are handled such as replacing the values with 
zero.
>>>       # Create the table of expected frequencies.
>>>       total = table.sum()
>>>        
total=table.sum(dtype=float) # dtype will not be needed if integer 
division is not used (ie Python3)
>>>       row_sum = table.sum(axis=1).reshape(-1,1)
>>>       col_sum = table.sum(axis=0)
>>>       expected = row_sum * col_sum / float(total)
>>>        

expected = row_sum * col_sum /total

>> I think that np.outer(row_sum, col_sum) is clearer than reshaping one to
>> be a column vector.
>>      
Make it one liner:
expected = np.outer( table.sum(axis=1),  table.sum(axis=0))/total

>>>       # Since we are passing in 1D arrays of length table.size, the default
>>>       # number of degrees of freedom is table.size-1.
>>>       # For a contingency table, the actual number degrees of freedom is
>>>       # (nr - 1)*(nc-1).  We use the ddof argument
>>>       # of the chisquare function to adjust the default.
>>>       nr, nc = table.shape
>>>       dof = (nr - 1) * (nc - 1)
>>>       dof_adjust = (table.size - 1) - dof
>>>
>>>       chi2, p = chisquare(np.ravel(table), np.ravel(expected),
>>> ddof=dof_adjust)
>>>        
Where is your chisquare function - this is meant to be a standard alone 
function?

Why not do say:

import special

chi2_value=(((table-expected)**2)/expected).sum()
chi2_prob=special.chdtrc(dof,chi2_value)

>>>       return chi2, p
>>>        
>
> Just a thought:
> I think it would be useful to have this kind of proposals on the
> scipy-user list (even though it is a dev issue), just to be able to
> get more feedback from potential users.
>
> And again,
> Thanks Neil, it's very nice to have the statistics in the docstrings
> instead of having to run to Wikipedia
>
> Josef
>
>    
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>>      
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>    
Bruce
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20100602/fcd8b7db/attachment.html>