[SciPy-Dev] chi-square test for a contingency (R x C) table

Thu Jun 17 11:45:54 EDT 2010

On Thu, Jun 17, 2010 at 11:31 AM, Bruce Southey <bsouthey at gmail.com> wrote:
> On 06/17/2010 09:50 AM, josef.pktd at gmail.com wrote:
>> On Thu, Jun 17, 2010 at 10:41 AM, Warren Weckesser
>> <warren.weckesser at enthought.com>  wrote:
>>
>>> Bruce Southey wrote:
>>>
>>>> On 06/16/2010 11:58 PM, Warren Weckesser wrote:
>>>>
>>>>
>>>>> The feedback in this thread inspired me to generalize my original code
>>>>> to the n-way test of independence.  I have attached the revised code to
>>>>> a new ticket:
>>>>>
>>>>>       http://projects.scipy.org/scipy/ticket/1203
>>>>>
>>>>> More feedback would be great!
>>>>>
>>>>> Warren
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> The handling for a one way table is wrong:
>>>>   >>>print 'One way', chisquare_nway([6, 2])
>>>> (0.0, 1.0, 0, array([ 6.,  2.]))
>>>>
>>>> It should also do the marginal independence tests.
>>>>
>>>>
>>> As I explained in the description of the ticket and in the docstring,
>>> this function is not intended for doing the 'one-way' goodness of fit.
>>> stats.chisquare should be used for that.  Calling chisquare_nway with a
>>> 1D array amounts to doing a test of independence between groupings but
>>> only giving a single grouping, hence the trivial result.  This is
>>> intentional.
>>>
>
>
> In expected-nway, you say that "While this function can handle a 1D
> array," but clearly it does not handle it correctly.
> If it was your intention not to do one way tables, then you *must* check
> the input and reject one way tables!
>>> I guess the question is: should there be a "clever" chi-square function
>>> that figures out what the user probably wants to do?
>>>
> My issue is that the chi-squared test statistic is still calculated in
> exactly the same way for n-way tables where n>0. So it is pure
> unnecessary duplication of functionality if you require a second
> function for the one way table. I also prefer the one-stop shopping approach

just because it's chisquare doesn't mean it's the same kind of tests.
This is a test for independence or association that only makes sense
if there are at least two random variables.

>>> np.corrcoef(np.arange(5))
1

the equivalent function in R is summary(xtab(...))  not chisquare

I don't like mixing shoes and apples.

>
>>>
>>>
>>>> I would have expected the conversion of the input into an array in the
>>>> chisquare_nway function.  If the input is is not an array, then there is
>>>> a potential bug waiting to happen because you expect numpy to correctly
>>>> compute the observed minus expected. For example, if the input is a list
>>>> then it relies on numpy doing a list minus a ndarray.  It is also
>>>> inefficient in the sense that you have to convert the input twice (once
>>>> for the expected values and once for the observed minus expected
>>>> calculation.
>>>>
>>>
>>> I was going to put in something like table = np.asarray(table), but then
>>> I noticed that, since `expected` had already been converted to an array,
>>> the calculation worked even if `table` was a list.  E.g.
>>>
>>> In [4]: chisquare_nway([[10,10],[5,25]])
>>> Out[4]:
>>> (6.3492063492063489,
>>>   0.011743382301172606,
>>>   1,
>>>   array([[  6.,  14.],
>>>        [  9.,  21.]]))
>>>
>>> But I will put in the conversion--that will make it easier to do a few
>>> other sanity checks on the input before trying to do any calculations.
>>>
>>>
>>>>   You can also get interesting errors with a string input
>>>> where the reason may not be obvious:
>>>>
>>>>   >>>print 'twoway', chisquare_nway([['6', '2'], ['4', '11']])
>>>>     File "chisquare_nway.py", line 132, in chisquare_nway
>>>>       chi2 = ((table - expected)**2 / expected).sum()
>>>> TypeError: unsupported operand type(s) for -: 'list' and 'numpy.ndarray'
>>>>
>>>>
>>>> I don't recall how np.asarray handles very large numbers but I would
>>>> also suggest an optional dtype argument instead of forcing float64 dtype:
>>>> "table = np.asarray(table, dtype=np.float64)"
>>>>
>>>>
>>>>
>>> Sure, I can add that.
>>>
>> the table values are integers and I don't think there can be a problem
>> with float64.
>>
>> If we start to add dtype arguments in stats function, then we might
>> need more checking where and whether it's really relevant.
>>
>> Josef
>>
> Any time an operation uses summation, there will be the potential for
> overflow that can be very serious for certain numerical types such as
> integers. Consequently, numpy provides an optional dtype in accumulation
> related functions like sum and mean. This avoids a user having to change
> the input from a lower precision to a higher precision thus mitigating
> the overflow problem. Thus, if a function uses say numpy's sum or
> variance, adding the dtype option is free protection.

That's true in general, but is there a contingency table where
overflow would ever occur at float64 or where there would be a
significant loss of precision?
My suspicion for this case is that a dtype argument just creates noise.

Josef

>
>
>
>>
>>
>>>
>>>> In expected_nway(), you could prestore a variable with the  'range(d)'
>>>> although the saving is little for small tables.
>>>> Also, I would like to remove the usage of set() in the loop.
>>>> If k=2:
>>>>
>>>>   >>>  list(set(range(d))-set([k]))
>>>> [0, 1, 3, 4]
>>>>   >>>  rd=range(5) #which would be outside the loop
>>>>   >>>  [ elem for elem in rd if elem != k ]
>>>> [0, 1, 3, 4]
>>>>
>>>>
>>>>
>>> Looks good--I'll make that change.
>>>
>>>
> Bruce
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>