[SciPy-Dev] chi-square test for a contingency (R x C) table

Wed Jun 2 08:24:25 EDT 2010

On 2010-06-01 23:28 , Warren Weckesser wrote:
> I've been digging into some basic statistics recently, and developed the
> following function for applying the chi-square test to a contingency
> table.  Does something like this already exist in scipy.stats? If not,
> any objects to adding it?  (Tests are already written :)

Something like this would be great in scipy.stats since I end up doing 
the exact same thing by hand whenever I grade introductory statistics 
exams.  Thanks for writing this!

I've got some code review comments that I'll include below.

> def chisquare_contingency(table):

I think that chiquare_twoway fits the common name for this test better, 
but as Joseph mentions, this neglects the possibility of expanding this 
to n-dimensions.

>      """Chi-square calculation for a contingency (R x C) table.

The docstring should emphasize that this is a hypothesis test.  See for 
example http://docs.scipy.org/scipy/docs/scipy.stats.stats.ttest_rel/. 
I'm not familiar with the R x C notation, but it does work to make clear 
which chi square test this is.

>
>      This function computes the chi-square statistic and p-value of the
>      data in the table.  The expected frequencies are computed based on
>      the relative frequencies in the table.

I try to explain what the null and alternative hypotheses are for the 
tests in scipy.stats.

>
>      Parameters
>      ----------
>      table : array_like, 2D
>          The contingency table, also known as the R x C table.

This could also say something like "The table contains the observed 
frequencies of each category."

>
>      Returns
>      -------
>      chisquare statistic : float
>          The chisquare test statistic
>      p : float
>          The p-value of the test.

A function like this could really use an example, perhaps straight from 
one of the tests.

>      """
>      table = np.asarray(table)
>      if table.ndim != 2:
>          raise ValueError("table must be a 2D array.")
>
>      # Create the table of expected frequencies.
>      total = table.sum()
>      row_sum = table.sum(axis=1).reshape(-1,1)
>      col_sum = table.sum(axis=0)
>      expected = row_sum * col_sum / float(total)

I think that np.outer(row_sum, col_sum) is clearer than reshaping one to 
be a column vector.

>
>      # Since we are passing in 1D arrays of length table.size, the default
>      # number of degrees of freedom is table.size-1.
>      # For a contingency table, the actual number degrees of freedom is
>      # (nr - 1)*(nc-1).  We use the ddof argument
>      # of the chisquare function to adjust the default.
>      nr, nc = table.shape
>      dof = (nr - 1) * (nc - 1)
>      dof_adjust = (table.size - 1) - dof
>
>      chi2, p = chisquare(np.ravel(table), np.ravel(expected),
> ddof=dof_adjust)
>      return chi2, p