[SciPy-Dev] chi-square test for a contingency (R x C) table

Thu Jun 17 12:59:34 EDT 2010

On 06/17/2010 10:45 AM, josef.pktd at gmail.com wrote:
> On Thu, Jun 17, 2010 at 11:31 AM, Bruce Southey<bsouthey at gmail.com>  wrote:
>    
>> On 06/17/2010 09:50 AM, josef.pktd at gmail.com wrote:
>>      
>>> On Thu, Jun 17, 2010 at 10:41 AM, Warren Weckesser
>>> <warren.weckesser at enthought.com>    wrote:
>>>
>>>        
>>>> Bruce Southey wrote:
>>>>
>>>>          
>>>>> On 06/16/2010 11:58 PM, Warren Weckesser wrote:
>>>>>
>>>>>
>>>>>            
>>>>>> The feedback in this thread inspired me to generalize my original code
>>>>>> to the n-way test of independence.  I have attached the revised code to
>>>>>> a new ticket:
>>>>>>
>>>>>>        http://projects.scipy.org/scipy/ticket/1203
>>>>>>
>>>>>> More feedback would be great!
>>>>>>
>>>>>> Warren
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>>> The handling for a one way table is wrong:
>>>>>    >>>print 'One way', chisquare_nway([6, 2])
>>>>> (0.0, 1.0, 0, array([ 6.,  2.]))
>>>>>
>>>>> It should also do the marginal independence tests.
>>>>>
>>>>>
>>>>>            
>>>> As I explained in the description of the ticket and in the docstring,
>>>> this function is not intended for doing the 'one-way' goodness of fit.
>>>> stats.chisquare should be used for that.  Calling chisquare_nway with a
>>>> 1D array amounts to doing a test of independence between groupings but
>>>> only giving a single grouping, hence the trivial result.  This is
>>>> intentional.
>>>>
>>>>          
>>
>> In expected-nway, you say that "While this function can handle a 1D
>> array," but clearly it does not handle it correctly.
>> If it was your intention not to do one way tables, then you *must* check
>> the input and reject one way tables!
>>      
>>>> I guess the question is: should there be a "clever" chi-square function
>>>> that figures out what the user probably wants to do?
>>>>
>>>>          
>> My issue is that the chi-squared test statistic is still calculated in
>> exactly the same way for n-way tables where n>0. So it is pure
>> unnecessary duplication of functionality if you require a second
>> function for the one way table. I also prefer the one-stop shopping approach
>>      
> just because it's chisquare doesn't mean it's the same kind of tests.
> This is a test for independence or association that only makes sense
> if there are at least two random variables.
>    

Wrong!
See for example:
http://en.wikipedia.org/wiki/Pearson's_chi-square_test
"Pearson's chi-square is used to assess two types of comparison: tests 
of goodness of fit and tests of independence."

The exact same test statistic is being calculated just that the 
hypothesis is different (which is the user's problem not the function's 
problem). So please separate the hypothesis from the test statistic.

>>>> np.corrcoef(np.arange(5))
>>>>          
> 1
>
> the equivalent function in R is summary(xtab(...))  not chisquare
>    
I think you have the wrong function because corrcoef "Return correlation 
coefficients" which is R's cor function.
What is xtab? It is not an R function perhaps someone's R package.

> I don't like mixing shoes and apples.
>    
Then please don't.
>>      
>>>>
>>>>          
>>>>> I would have expected the conversion of the input into an array in the
>>>>> chisquare_nway function.  If the input is is not an array, then there is
>>>>> a potential bug waiting to happen because you expect numpy to correctly
>>>>> compute the observed minus expected. For example, if the input is a list
>>>>> then it relies on numpy doing a list minus a ndarray.  It is also
>>>>> inefficient in the sense that you have to convert the input twice (once
>>>>> for the expected values and once for the observed minus expected
>>>>> calculation.
>>>>>
>>>>>            
>>>> I was going to put in something like table = np.asarray(table), but then
>>>> I noticed that, since `expected` had already been converted to an array,
>>>> the calculation worked even if `table` was a list.  E.g.
>>>>
>>>> In [4]: chisquare_nway([[10,10],[5,25]])
>>>> Out[4]:
>>>> (6.3492063492063489,
>>>>    0.011743382301172606,
>>>>    1,
>>>>    array([[  6.,  14.],
>>>>         [  9.,  21.]]))
>>>>
>>>> But I will put in the conversion--that will make it easier to do a few
>>>> other sanity checks on the input before trying to do any calculations.
>>>>
>>>>
>>>>          
>>>>>    You can also get interesting errors with a string input
>>>>> where the reason may not be obvious:
>>>>>
>>>>>    >>>print 'twoway', chisquare_nway([['6', '2'], ['4', '11']])
>>>>>      File "chisquare_nway.py", line 132, in chisquare_nway
>>>>>        chi2 = ((table - expected)**2 / expected).sum()
>>>>> TypeError: unsupported operand type(s) for -: 'list' and 'numpy.ndarray'
>>>>>
>>>>>
>>>>> I don't recall how np.asarray handles very large numbers but I would
>>>>> also suggest an optional dtype argument instead of forcing float64 dtype:
>>>>> "table = np.asarray(table, dtype=np.float64)"
>>>>>
>>>>>
>>>>>
>>>>>            
>>>> Sure, I can add that.
>>>>
>>>>          
>>> the table values are integers and I don't think there can be a problem
>>> with float64.
>>>
>>> If we start to add dtype arguments in stats function, then we might
>>> need more checking where and whether it's really relevant.
>>>
>>> Josef
>>>
>>>        
>> Any time an operation uses summation, there will be the potential for
>> overflow that can be very serious for certain numerical types such as
>> integers. Consequently, numpy provides an optional dtype in accumulation
>> related functions like sum and mean. This avoids a user having to change
>> the input from a lower precision to a higher precision thus mitigating
>> the overflow problem. Thus, if a function uses say numpy's sum or
>> variance, adding the dtype option is free protection.
>>      
> That's true in general, but is there a contingency table where
> overflow would ever occur at float64 or where there would be a
> significant loss of precision?
> My suspicion for this case is that a dtype argument just creates noise.
>
> Josef
>    
Please do not mix 'precision' with 'overflow' as these are not same issue:
http://steve.hollasch.net/cgindex/coding/ieeefloat.html
"Overflow means that values have grown too large for the representation."

The way the expected values are computed involves summation so it can 
occur for large dimensions with large values in it.
 >>> np.sum([10000000000000]*1000000)
-8446744073709551616

Obviously avoidable if your OS supports higher precision (but that is 
indeed noise when the system does not support higher precision)
 >>> np.sum([10000000000000]*1000000, dtype=np.float128)
10000000000000000000.0

Bruce

>    
>>
>>
>>      
>>>
>>>        
>>>>          
>>>>> In expected_nway(), you could prestore a variable with the  'range(d)'
>>>>> although the saving is little for small tables.
>>>>> Also, I would like to remove the usage of set() in the loop.
>>>>> If k=2:
>>>>>
>>>>>    >>>    list(set(range(d))-set([k]))
>>>>> [0, 1, 3, 4]
>>>>>    >>>    rd=range(5) #which would be outside the loop
>>>>>    >>>    [ elem for elem in rd if elem != k ]
>>>>> [0, 1, 3, 4]
>>>>>
>>>>>
>>>>>
>>>>>            
>>>> Looks good--I'll make that change.
>>>>
>>>>
>>>>          
>> Bruce
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>>      
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>    

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20100617/174b71a6/attachment.html>