[SciPy-Dev] chi-square test for a contingency (R x C) table

josef.pktd at gmail.com josef.pktd at gmail.com
Thu Jun 17 10:50:13 EDT 2010


On Thu, Jun 17, 2010 at 10:41 AM, Warren Weckesser
<warren.weckesser at enthought.com> wrote:
> Bruce Southey wrote:
>> On 06/16/2010 11:58 PM, Warren Weckesser wrote:
>>
>>> The feedback in this thread inspired me to generalize my original code
>>> to the n-way test of independence.  I have attached the revised code to
>>> a new ticket:
>>>
>>>      http://projects.scipy.org/scipy/ticket/1203
>>>
>>> More feedback would be great!
>>>
>>> Warren
>>>
>>>
>>>
>>>
>> The handling for a one way table is wrong:
>>  >>>print 'One way', chisquare_nway([6, 2])
>> (0.0, 1.0, 0, array([ 6.,  2.]))
>>
>> It should also do the marginal independence tests.
>>
>
> As I explained in the description of the ticket and in the docstring,
> this function is not intended for doing the 'one-way' goodness of fit.
> stats.chisquare should be used for that.  Calling chisquare_nway with a
> 1D array amounts to doing a test of independence between groupings but
> only giving a single grouping, hence the trivial result.  This is
> intentional.
>
> I guess the question is: should there be a "clever" chi-square function
> that figures out what the user probably wants to do?
>
>
>> I would have expected the conversion of the input into an array in the
>> chisquare_nway function.  If the input is is not an array, then there is
>> a potential bug waiting to happen because you expect numpy to correctly
>> compute the observed minus expected. For example, if the input is a list
>> then it relies on numpy doing a list minus a ndarray.  It is also
>> inefficient in the sense that you have to convert the input twice (once
>> for the expected values and once for the observed minus expected
>> calculation.
>
>
> I was going to put in something like table = np.asarray(table), but then
> I noticed that, since `expected` had already been converted to an array,
> the calculation worked even if `table` was a list.  E.g.
>
> In [4]: chisquare_nway([[10,10],[5,25]])
> Out[4]:
> (6.3492063492063489,
>  0.011743382301172606,
>  1,
>  array([[  6.,  14.],
>       [  9.,  21.]]))
>
> But I will put in the conversion--that will make it easier to do a few
> other sanity checks on the input before trying to do any calculations.
>
>>  You can also get interesting errors with a string input
>> where the reason may not be obvious:
>>
>>  >>>print 'twoway', chisquare_nway([['6', '2'], ['4', '11']])
>>    File "chisquare_nway.py", line 132, in chisquare_nway
>>      chi2 = ((table - expected)**2 / expected).sum()
>> TypeError: unsupported operand type(s) for -: 'list' and 'numpy.ndarray'
>>
>>
>> I don't recall how np.asarray handles very large numbers but I would
>> also suggest an optional dtype argument instead of forcing float64 dtype:
>> "table = np.asarray(table, dtype=np.float64)"
>>
>>
>
> Sure, I can add that.

the table values are integers and I don't think there can be a problem
with float64.

If we start to add dtype arguments in stats function, then we might
need more checking where and whether it's really relevant.

Josef


>
>> In expected_nway(), you could prestore a variable with the  'range(d)'
>> although the saving is little for small tables.
>> Also, I would like to remove the usage of set() in the loop.
>> If k=2:
>>
>>  >>> list(set(range(d))-set([k]))
>> [0, 1, 3, 4]
>>  >>> rd=range(5) #which would be outside the loop
>>  >>> [ elem for elem in rd if elem != k ]
>> [0, 1, 3, 4]
>>
>>
>
> Looks good--I'll make that change.
>
>
>> Bruce
>>
>>
>>
>>
>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>



More information about the SciPy-Dev mailing list