[SciPy-Dev] chi-square test for a contingency (R x C) table
Neil Martinsen-Burrell
nmb at wartburg.edu
Thu Jun 17 20:43:18 EDT 2010
On 2010-06-17 11:59, Bruce Southey wrote:
> On 06/17/2010 10:45 AM, josef.pktd at gmail.com wrote:
>> On Thu, Jun 17, 2010 at 11:31 AM, Bruce Southey<bsouthey at gmail.com> wrote:
>>
>>> On 06/17/2010 09:50 AM,josef.pktd at gmail.com wrote:
>>>
>>>> On Thu, Jun 17, 2010 at 10:41 AM, Warren Weckesser
>>>> <warren.weckesser at enthought.com> wrote:
>>>>
>>>>
>>>>> Bruce Southey wrote:
>>>>>
>>>>>
>>>>>> On 06/16/2010 11:58 PM, Warren Weckesser wrote:
[...]
>>>>>> The handling for a one way table is wrong:
>>>>>> >>>print 'One way', chisquare_nway([6, 2])
>>>>>> (0.0, 1.0, 0, array([ 6., 2.]))
>>>>>>
>>>>>> It should also do the marginal independence tests.
>>>>>>
>>>>> As I explained in the description of the ticket and in the docstring,
>>>>> this function is not intended for doing the 'one-way' goodness of fit.
>>>>> stats.chisquare should be used for that. Calling chisquare_nway with a
>>>>> 1D array amounts to doing a test of independence between groupings but
>>>>> only giving a single grouping, hence the trivial result. This is
>>>>> intentional.
>>>
>>> In expected-nway, you say that "While this function can handle a 1D
>>> array," but clearly it does not handle it correctly.
>>> If it was your intention not to do one way tables, then you *must* check
>>> the input and reject one way tables!
>>>
>>>>> I guess the question is: should there be a "clever" chi-square function
>>>>> that figures out what the user probably wants to do?
>>>>>
>>>>>
>>> My issue is that the chi-squared test statistic is still calculated in
>>> exactly the same way for n-way tables where n>0. So it is pure
>>> unnecessary duplication of functionality if you require a second
>>> function for the one way table. I also prefer the one-stop shopping approach
>>>
>> just because it's chisquare doesn't mean it's the same kind of tests.
>> This is a test for independence or association that only makes sense
>> if there are at least two random variables.
>
> Wrong!
> See for example:
> http://en.wikipedia.org/wiki/Pearson's_chi-square_test
> "Pearson's chi-square is used to assess two types of comparison: tests
> of goodness of fit and tests of independence."
>
> The exact same test statistic is being calculated just that the
> hypothesis is different (which is the user's problem not the function's
> problem). So please separate the hypothesis from the test statistic.
It is only the exact same test statistic if we know the expected cell
counts. How these expected cell counts are determined depends
completely on the type of test that is being carried out. In a
goodness-of-fit test (chisquare_oneway) the proportions of each cell
must be specified in the null hypothesis. For an independence test
(chisquare_nway), the expected cell counts are computed from the given
data and the null hypothesis of independence. The fact that the formula
involving observed and expected numbers is the same should not obscure
the fact that the expected numbers come from two completely different
assumptions in the n=1 and n>1 cases. Can you explain how the expected
cell counts should be determined in the 1D case without the function
making assumptions about the user's null hypothesis?
I believe that we CANNOT separate the test statistic from the user's
null hypothesis and that is the reason that chisquare_oneway and
chisquare_nway should be separate functions. The information required
to properly do a goodness-of-fit test is qualitatively different than
that required to do an independence test. I support your suggestion to
reject 1D arrays as input for chisquare_nway. (With appropriate checking
for arrays such as np.array([[[1, 2, 3, 4]]].)
>> I don't like mixing shoes and apples.
>>
> Then please don't.
Great. I'm glad to see that we all agree that chisquare_oneway and
chisquare_nway should remain separate functions. :)
-Neil
More information about the SciPy-Dev
mailing list