[SciPy-Dev] chi-square test for a contingency (R x C) table

Thu Jun 17 13:24:40 EDT 2010

On Thu, Jun 17, 2010 at 12:59 PM, Bruce Southey <bsouthey at gmail.com> wrote:
> On 06/17/2010 10:45 AM, josef.pktd at gmail.com wrote:
>
> On Thu, Jun 17, 2010 at 11:31 AM, Bruce Southey <bsouthey at gmail.com> wrote:
>
>
> On 06/17/2010 09:50 AM, josef.pktd at gmail.com wrote:
>
>
> On Thu, Jun 17, 2010 at 10:41 AM, Warren Weckesser
> <warren.weckesser at enthought.com>  wrote:
>
>
>
> Bruce Southey wrote:
>
>
>
> On 06/16/2010 11:58 PM, Warren Weckesser wrote:
>
>
>
>
> The feedback in this thread inspired me to generalize my original code
> to the n-way test of independence.  I have attached the revised code to
> a new ticket:
>
>       http://projects.scipy.org/scipy/ticket/1203
>
> More feedback would be great!
>
> Warren
>
>
>
>
>
>
>
> The handling for a one way table is wrong:
>   >>>print 'One way', chisquare_nway([6, 2])
> (0.0, 1.0, 0, array([ 6.,  2.]))
>
> It should also do the marginal independence tests.
>
>
>
>
> As I explained in the description of the ticket and in the docstring,
> this function is not intended for doing the 'one-way' goodness of fit.
> stats.chisquare should be used for that.  Calling chisquare_nway with a
> 1D array amounts to doing a test of independence between groupings but
> only giving a single grouping, hence the trivial result.  This is
> intentional.
>
>
>
> In expected-nway, you say that "While this function can handle a 1D
> array," but clearly it does not handle it correctly.
> If it was your intention not to do one way tables, then you *must* check
> the input and reject one way tables!
>
>
> I guess the question is: should there be a "clever" chi-square function
> that figures out what the user probably wants to do?
>
>
>
> My issue is that the chi-squared test statistic is still calculated in
> exactly the same way for n-way tables where n>0. So it is pure
> unnecessary duplication of functionality if you require a second
> function for the one way table. I also prefer the one-stop shopping approach
>
>
> just because it's chisquare doesn't mean it's the same kind of tests.
> This is a test for independence or association that only makes sense
> if there are at least two random variables.
>
>
> Wrong!
> See for example:
> http://en.wikipedia.org/wiki/Pearson's_chi-square_test
> "Pearson's chi-square is used to assess two types of comparison: tests of
> goodness of fit and tests of independence."
>
> The exact same test statistic is being calculated just that the hypothesis
> is different (which is the user's problem not the function's problem). So
> please separate the hypothesis from the test statistic.
>
>
> np.corrcoef(np.arange(5))
>
>
> 1
>
> the equivalent function in R is summary(xtab(...))  not chisquare
>
>
> I think you have the wrong function because corrcoef "Return correlation
> coefficients" which is R's cor function.
> What is xtab? It is not an R function perhaps someone's R package.
>
> I don't like mixing shoes and apples.
>
>
> Then please don't.
>
>
>
>
>
> I would have expected the conversion of the input into an array in the
> chisquare_nway function.  If the input is is not an array, then there is
> a potential bug waiting to happen because you expect numpy to correctly
> compute the observed minus expected. For example, if the input is a list
> then it relies on numpy doing a list minus a ndarray.  It is also
> inefficient in the sense that you have to convert the input twice (once
> for the expected values and once for the observed minus expected
> calculation.
>
>
>
> I was going to put in something like table = np.asarray(table), but then
> I noticed that, since `expected` had already been converted to an array,
> the calculation worked even if `table` was a list.  E.g.
>
> In [4]: chisquare_nway([[10,10],[5,25]])
> Out[4]:
> (6.3492063492063489,
>   0.011743382301172606,
>   1,
>   array([[  6.,  14.],
>        [  9.,  21.]]))
>
> But I will put in the conversion--that will make it easier to do a few
> other sanity checks on the input before trying to do any calculations.
>
>
>
>
>   You can also get interesting errors with a string input
> where the reason may not be obvious:
>
>   >>>print 'twoway', chisquare_nway([['6', '2'], ['4', '11']])
>     File "chisquare_nway.py", line 132, in chisquare_nway
>       chi2 = ((table - expected)**2 / expected).sum()
> TypeError: unsupported operand type(s) for -: 'list' and 'numpy.ndarray'
>
>
> I don't recall how np.asarray handles very large numbers but I would
> also suggest an optional dtype argument instead of forcing float64 dtype:
> "table = np.asarray(table, dtype=np.float64)"
>
>
>
>
>
> Sure, I can add that.
>
>
>
> the table values are integers and I don't think there can be a problem
> with float64.
>
> If we start to add dtype arguments in stats function, then we might
> need more checking where and whether it's really relevant.
>
> Josef
>
>
>
> Any time an operation uses summation, there will be the potential for
> overflow that can be very serious for certain numerical types such as
> integers. Consequently, numpy provides an optional dtype in accumulation
> related functions like sum and mean. This avoids a user having to change
> the input from a lower precision to a higher precision thus mitigating
> the overflow problem. Thus, if a function uses say numpy's sum or
> variance, adding the dtype option is free protection.
>
>
> That's true in general, but is there a contingency table where
> overflow would ever occur at float64 or where there would be a
> significant loss of precision?
> My suspicion for this case is that a dtype argument just creates noise.
>
> Josef
>
>
> Please do not mix 'precision' with 'overflow' as these are not same issue:
> http://steve.hollasch.net/cgindex/coding/ieeefloat.html
> "Overflow means that values have grown too large for the representation."
>
> The way the expected values are computed involves summation so it can occur
> for large dimensions with large values in it.
>>>> np.sum([10000000000000]*1000000)
> -8446744073709551616
>
> Obviously avoidable if your OS supports higher precision (but that is indeed
> noise when the system does not support higher precision)
>>>> np.sum([10000000000000]*1000000, dtype=np.float128)
> 10000000000000000000.0

since Warren converts to float
33	    table = np.asarray(table, dtype=np.float64)

there is never an integer sum, so integer overflow is converted to a
possible precision problem

>>> np.sum([10000000000000]*1000000, dtype=np.float64)
1e+019

If there were gigantic contingency tables
reduce(np.multiply, margins)
could be replaced by exp(sum(log(...))

Josef

Josef


>
> Bruce
>
>
>
>
>
>
>
>
>
>
> In expected_nway(), you could prestore a variable with the  'range(d)'
> although the saving is little for small tables.
> Also, I would like to remove the usage of set() in the loop.
> If k=2:
>
>   >>>  list(set(range(d))-set([k]))
> [0, 1, 3, 4]
>   >>>  rd=range(5) #which would be outside the loop
>   >>>  [ elem for elem in rd if elem != k ]
> [0, 1, 3, 4]
>
>
>
>
>
> Looks good--I'll make that change.
>
>
>
>
> Bruce
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>