[SciPy-Dev] chi-square test for a contingency (R x C) table

Thu Jun 3 02:39:01 EDT 2010

On Thu, Jun 3, 2010 at 2:09 AM,  <josef.pktd at gmail.com> wrote:
> On Thu, Jun 3, 2010 at 12:47 AM, Neil Martinsen-Burrell
> <nmb at wartburg.edu> wrote:
>> On 2010-06-02 15:03 , Bruce Southey wrote:
>>> On 06/02/2010 01:41 PM, josef.pktd at gmail.com wrote:
>>>> On Wed, Jun 2, 2010 at 2:18 PM, Neil Martinsen-Burrell<nmb at wartburg.edu>  wrote:
>>>>
>>>>> On 2010-06-02 13:10 , Bruce Southey wrote:
>>
>> [...]
>>
>>>> I agree with Neil that this is a very useful convenience function.
>>>>
>>> My problem with the chisquare_twoway is that it should not call another
>>> function to finish two lines of code. It is just an excessive waste of
>>> resources.
>>
>> Do you mean that you would rather see the equivalent of
>>
>> chisq = (table - expected)**2 / expected
>> return chisq, chisqprob(chisq, dof)
>>
>> at the bottom of chisquare_contingency than the current call to
>> chisquare?  I'm certainly okay with that.
>
> But don't forget to ravel or you get cell-wise chisquare :)
> For non-performance sensitive parts, as in this case I usually go by
> how easy the function is to understand and to test.
> for example I prefer distributions.chi2.sf(chisq, dof) to
> chisqprob(chisq, dof) (I haven't checked if it is correct because I
> immediately see that it is a one-sided pvalue.
>
> inlining in this case might be nicer because of dof (when inlining)
> versus ddof (when calling chisquare), I found the ddof confusing to
> read
>
> related: while I was skimming Bruce's reference
> http://faculty.vassar.edu/lowry/ch8pt2.html
> I saw that they recommend continuity correction for the 2by2 case.
> Do you know what the common position on continuity correction is in this case?
>
> (In something vaguely related to this, I read recently that some
> continuity correction make the test too conservative and are not
> recommended. But I don't remember for which test I read this.)

It actually is for chisquare
http://en.wikipedia.org/wiki/Yates%27_correction_for_continuity

Josef

>
> If there is test specific continuity correction, then chisquare will
> have to be inlined.
>
>>
>>>> I never heard of a one-way contingency table, my question was whether
>>>> the function should also handle 3-way or 4-way tables, additional to
>>>> two-way.
>>>>
>>> Correct to both of these as I just consider these as n-way tables. I
>>> think that contingency tables by definition only applies to the 2-d
>>> case. Pivot tables are essentially the same thing. I would have to
>>> lookup on how to get the expected number of outcomes but probably of the
>>> form Ni.. * N.j. *N..k/N... for the 3-way (the 2-way table is of the
>>> form Ni.*N.j/N..) for i=rows, j=columns, k=3rd axis and '.' means sum
>>> for that axis.
>>
>> That is the correct (tensor) formula for higher dimensional tables.
>> Pragmatically, since the number of cells climbs so rapidly with
>> increasing dimension, there are more problems with small expected
>> counts.  If we thought people would be interested in using it, we could
>> certainly define a chisquare_nway function as well.
>
> I'm not too happy about having a large number of small functions
> especially if they have code duplication and need to be separately
> maintained.
> When there is a demand for a convenient special case, then it could
> just call the more general function.
>
> For testing distribution, the common approach in the case when there
> are too few expected counts in some cells, is, to combine several
> cells together in one bin.
> I guess, there might be something like this also feasible for nway,
> i.e. coarsen the grid, or not?
>
>>
>>>> I thought about the question how the input should be specified for my
>>>> initial response, the alternative would be to use the original data or
>>>> a "long" format instead of a table. But I thought that as a
>>>> convenience function using the table format will be the most common
>>>> use.
>>>> I have written in the past functions that calculate the contingency
>>>> table, and would be very useful to have a more complete coverage of
>>>> tools to work with contingency tables in scipy.stats (or temporarily
>>>> in statsmodels, where we are working also on the anova type of
>>>> analysis)
>>>>
>>> It depends on what tasks are needed. Really there are two steps:
>>> 1) Cross-tabulation that summarized the data from whatever input
>>> (groupby would help here).
>>> 2) Statistical tests - series of functions that accept summarized data only.
>>>
>>> If you have separate functions then the burden is on the user to find
>>> and call all the desired functions. You can also provide a single helper
>>> function to do all that because you don't want to repeat unnecessary calls.
>>
>> The facilities for handling raw, frame-style data in scipy.stats are not
>> too strong.  A tabulation function that we could stick together with the
>> chisquare* functions to make a single helper would certainly be convenient.
>
> Since broader coverage of contingency tables with all the data
> handling, bincount and table conversions would a much larger set of
> functions.
>
> I think our still evolving design for statistics (including test) in
> statsmodels is to move to a more object oriented design, to keep
> things together, and to take advantage of reusing previous
> calculations.
>
> In this case it could be a ContingencyTable class that could combine
> creating the countdata from raw data (with or without missing values),
> marginalization if it's 3-way or higher, attach several tests, create
> a nice string that can be printed, and so on. With lazy evaluation and
> reuse of previous calculations, we think this would be a better design
> than only having standalone functions.
>
> grouping functions together:
> While statisticians might have a good overview of all the different
> test, I found the "laundry list" of functions in scipy.stats for a
> long time pretty confusing.
> Instead of having group of functions fisherexact, chisquare_twoway,
> chisquare_nway, and several other possible candidates for independence
> tests in contingency tables, we are starting to combine them together,
> e.g independence_tests, mean_tests, variance_tests and
> correlation_test
>
> We were discussing this in statsmodels in a different context, mainly
> diagnostic tests for regression, e.g. heteroscedasticity,
> autocorrelation tests or more recently post-hoc tests.
>
> In the current case, I also thought that combining with a fisherexact
> or other tests would potentially be useful, with a keyword argument
> that selects "chisquare", "exact", "..."
> Which is in this case not yet relevant because fisherexact, even when
> it works, is only for 2by2, and I don't think mixing them together is
> very useful.
>
> Josef
>
>
>
>> -Neil
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>