[SciPy-Dev] chi-square test for a contingency (R x C) table

Sat Jun 19 08:39:03 EDT 2010

On Fri, Jun 18, 2010 at 10:02 PM, Warren Weckesser
<warren.weckesser at enthought.com> wrote:
> Bruce Southey wrote:
>> On 06/17/2010 07:43 PM, Neil Martinsen-Burrell wrote:
>>
>>> On 2010-06-17 11:59, Bruce Southey wrote:
>>>
>>>> On 06/17/2010 10:45 AM, josef.pktd at gmail.com wrote:
>>>>
>>>>> On Thu, Jun 17, 2010 at 11:31 AM, Bruce Southey<bsouthey at gmail.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>> On 06/17/2010 09:50 AM,josef.pktd at gmail.com  wrote:
>>>>>>
>>>>>>
>>>>>>> On Thu, Jun 17, 2010 at 10:41 AM, Warren Weckesser
>>>>>>> <warren.weckesser at enthought.com>    wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Bruce Southey wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 06/16/2010 11:58 PM, Warren Weckesser wrote:
>>>>>>>>>
>>> [...]
>>>
>>>
>>>>>>>>> The handling for a one way table is wrong:
>>>>>>>>>
>>>>>>>>>>>> print 'One way', chisquare_nway([6, 2])
>>>>>>>>>>>>
>>>>>>>>> (0.0, 1.0, 0, array([ 6.,  2.]))
>>>>>>>>>
>>>>>>>>> It should also do the marginal independence tests.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> As I explained in the description of the ticket and in the
>>>>>>>> docstring,
>>>>>>>> this function is not intended for doing the 'one-way' goodness of
>>>>>>>> fit.
>>>>>>>> stats.chisquare should be used for that.  Calling chisquare_nway
>>>>>>>> with a
>>>>>>>> 1D array amounts to doing a test of independence between
>>>>>>>> groupings but
>>>>>>>> only giving a single grouping, hence the trivial result.  This is
>>>>>>>> intentional.
>>>>>>>>
>>>>>> In expected-nway, you say that "While this function can handle a 1D
>>>>>> array," but clearly it does not handle it correctly.
>>>>>> If it was your intention not to do one way tables, then you *must*
>>>>>> check
>>>>>> the input and reject one way tables!
>>>>>>
>>>>>>
>>>>>>>> I guess the question is: should there be a "clever" chi-square
>>>>>>>> function
>>>>>>>> that figures out what the user probably wants to do?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> My issue is that the chi-squared test statistic is still calculated in
>>>>>> exactly the same way for n-way tables where n>0. So it is pure
>>>>>> unnecessary duplication of functionality if you require a second
>>>>>> function for the one way table. I also prefer the one-stop shopping
>>>>>> approach
>>>>>>
>>>>>>
>>>>> just because it's chisquare doesn't mean it's the same kind of tests.
>>>>> This is a test for independence or association that only makes sense
>>>>> if there are at least two random variables.
>>>>>
>>>> Wrong!
>>>> See for example:
>>>> http://en.wikipedia.org/wiki/Pearson's_chi-square_test
>>>> "Pearson's chi-square is used to assess two types of comparison: tests
>>>> of goodness of fit and tests of independence."
>>>>
>>>> The exact same test statistic is being calculated just that the
>>>> hypothesis is different (which is the user's problem not the function's
>>>> problem). So please separate the hypothesis from the test statistic.
>>>>
>>> It is only the exact same test statistic if we know the expected cell
>>> counts.  How these expected cell counts are determined depends
>>> completely on the type of test that is being carried out.  In a
>>> goodness-of-fit test (chisquare_oneway) the proportions of each cell
>>> must be specified in the null hypothesis.  For an independence test
>>> (chisquare_nway), the expected cell counts are computed from the given
>>> data and the null hypothesis of independence.  The fact that the
>>> formula involving observed and expected numbers is the same should not
>>> obscure the fact that the expected numbers come from two completely
>>> different assumptions in the n=1 and n>1 cases.  Can you explain how
>>> the expected cell counts should be determined in the 1D case without
>>> the function making assumptions about the user's null hypothesis?
>>>
>> The user's hypothesis is totally irrelevant here because you are
>> computing the TEST STATISTIC (sum of observed minus expected squared
>> divided by degree of freedom), that is available in multiple statistical
>> books and a not so good link:
>> http://en.wikipedia.org/wiki/Pearson's_chi-square_test
>>
>> This is one reason that the current stats.chisquare function accepts
>> observed and expected values as input.
>>
>> So given a set of observed values and the corresponding set of expected
>> values, you calculate the test statistic and consequently a p-value (you
>> can assume that the test statistic has a chi-squared distribution but
>> nothing is stopping you from some other approach such as bootstrapping).
>> Yet, nothing in that calculation of the actual test statistic nor the
>> p-value makes any reference to the user's assumptions.
>>
>> Where the user's assumption enter is how they compute the expected
>> values as in your expected_nway function, which is NOT your
>> chisquare_nway function. How these expected values defined do depend on
>> the user's hypothesis and if the user does not define these then the
>> function has to make a guess of what the user wants. That guess depends
>> partly on the input (obviously independence is irrelevant for a 1way
>> table) and what the developer wants such as (marginal) independence for
>> higher order tables.
>>
>> Thus it is up to the user to know what assumptions they made and what
>> hypothesis is being tested when they interpret the p-value because scipy
>> is not an expert system that somehow knows what the user wants to get in
>> all cases.
>>
>>
>>> I believe that we CANNOT separate the test statistic from the user's
>>> null hypothesis and that is the reason that chisquare_oneway and
>>> chisquare_nway should be separate functions.  The information required
>>> to properly do a goodness-of-fit test is qualitatively different than
>>> that required to do an independence test.  I support your suggestion
>>> to reject 1D arrays as input for chisquare_nway. (With appropriate
>>> checking for arrays such as np.array([[[1, 2, 3, 4]]].)
>>>
>>
>> All you need to calculate the Pearson chi-squared is the observed and
>> expected values. I don't care how you obtain these values, but once you
>> do then everything afterwards is the same until the user decides on how
>> to interpret the p-value.
>>
>> Under your argument there needs to be a new function for every different
>> hypothesis that a user wants. So a 'goodness-of-fit test' must have a
>> separate function (actually probably a good idea) than a one-way test
>> function etc. But that is pure code bloat when the test statistic is
>> identical.
>>
>>
>>>>> I don't like mixing shoes and apples.
>>>>>
>>>>>
>>>> Then please don't.
>>>>
>>> Great.  I'm glad to see that we all agree that chisquare_oneway and
>>> chisquare_nway should remain separate functions. :)
>>>
>>> -Neil
>>>
>> Sure as we need more code duplication to increase the size of scipy and
>> confuse any user!
>>
>> Bruce
>>
>
> [Grab a cup of coffee--a long post follows!]
>
> Bruce is correct: there is really just one chi-square statistic
> calculation. To do it, you need the observed frequencies, the expected
> frequencies, and the number of degrees of freedom.  A reason for having
> more than one function is that the default values for the expected
> frequencies, and the formulas for the degrees of freedom, are different
> depending on the hypothesis to be tested.
>
> As an exercise, I wrote the function signature and docstring for the
> case of one chi-square function that handles both goodness of fit and
> tests for independence, and then did the same for the API with two
> functions.  I had a small set of use cases in mind.  My candidate APIs
> follow the use cases.
>
> (A fixed font is recommended from here on.)
>
> ======================================
> Chi-square Test - Elementary Use Cases
> ======================================
>
> 1. (Goodnes of fit, with uniform probabilities for the expected frequencies)
> A six-sided die is rolled 60 times and the occurrences of each side are
> counted.  The following table shows the result:
>
> Side   1   2   3   4   5   6
> ----------------------------
> Freq.  3  17   8  13   7  12
>
> Is the die fair?
>
> In this case, the expected frequencies are uniform: [10, 10, 10, 10, 10,
> 10].
>
> The number of degrees of freedom is 5.
>
>
> 2. (Goodness of fit, with given probabilities for the expected frequencies)
> An archery target contains three levels: the bull's eye, the inner ring
> and the outer ring.  By area, these make up 5%, 25% and 70% of the target,
> respectively.
>
> An archer hits the target with 100 arrows, and the number of hits in each
> level are recorded:
>
> Level   Bull's eye    Inner   Outer
> -----------------------------------
> Freq.      16          29      55
>
> Are these frequencies any better than random?
>
>
> In this case, the user provides the expected frequencies, as [5, 25, 70]
> (or perhaps as a probability distribution [0.05, 0.25, 0.70]).
>
> The number of degrees of freedom is 2.
>
>
> 3. (2x2 test of independence)
> A group of 54 men and 46 women are asked whether they prefer vanilla or
> chocolate ice cream.  The results are:
>
>       Vanilla   Chocolate
> Men      23        31
> Women    19        27
>
> Is the flavor preference independent of sex?
>
> In a test of independence like this, one typically used the marginal
> sums and the assumption of independence to create the expected frequencies.
>
> The number of degrees of freedom is 1.
>
> Yates' correction should be applied.
>
>
> 4. (3x3x2 test of independence)
>
> No story, just some numbers:
>
> Group 1
>    X   Y   Z
> A  18  45 113
> B  32  60 321
> C  74  33  67
>
> Group 2
>    X   Y   Z
> A  11  33  87
> B  19  45 267
> C  40  19  80
>
> Are the groupings independent?
>
> The expected frequencies are to be computed from the marginal sums,
> assuming that the factors are independent.  The number of degrees of
> freedom is 12.
>
>
> ====================================
> API: One or two (or more) functions?
> ====================================
>
> Bruce has suggested implementing a "one-stop shopping" function.
> Here's my take on what such a function would look like:
>
> Single Function
> ---------------
>
> def chisquare(obs, expected=None, correction=False):
>    """Chi-square all-in-one function.
>
>    Parameters
>    ----------
>    obs : array_like
>        The observed frequencies
>
>    expected : array_like, with the same shape as obs (optional)
>
>        The expected frequencies.  If expected is given, then no matter
>        what the dimension of `expected`, the number of degrees of freedom
>        is equal to one less than the number of elements in `expected`.
>        (This amounts to a "goodness of fit" test.)
>
>        If `expected` is not given, and `obs` is 1D, the function
>        assumes that expected is uniformly distributed.  The number of
> degrees
>        of freedom is one less than the number of elements in `expected`.
>        (This is a test of "goodness of fit" to a uniform distribution.)
>
>        If `expected` is not given and `obs` is 2D or higher dimensional,
>        the function will compute the expected frequencies based on the
>        marginal sums of `obs` and assuming independence of the factors.
>        The number degrees of freedom is the number of elements in `obs`,
>        minus the sum of the dimensions, plus the number of dimensions
> minus 1.
>        (This is the "test of independence".)
>
>    correction : bool (optional)
>        If True and the number of degrees of freedom is one, Yates'
> correction
>        for continuity is applied.
>
>    Returns
>    -------
>    chi2 : float
>        The Chi-square statistic
>    p : float
>        The p-value
>    dof : int
>        The degrees of freedom
>    expected : ndarray (same shape as `obs`)
>        The expected frequencies
>    """
>
> Note that this function handles the case of a given `expected` array
> that is 2D or higher dimensional.  It basically ignores the dimension
> in this case and treats the problem the same as the 1D case.
>
> I haven't included a `ddof` argument. I am following the YAGNI
> principle:  all the use cases that I need are covered without `ddof`.
> But that may reflect my inexperience with chi-square tests.
>
> I prefer to not have the return type depend on the arguments, so it
> always returns `expected`, even if `expected` was given as an
> argument.
>
>
> Two Functions
> -------------
>
> An alternative is to provide two functions.  One function is intended
> for goodness-of-fit tests, which really just means that (a) the number
> of degrees of freedom is one less than the number of observed
> frequencies, and (b) if the expected frequencies are not given, they
> are assumed to be uniform.  The other function is for testing
> independence of factors.  The test only makes sense for arrays of
> dimension 2 or higher. The 1D case can be handled  consistently,
> but it is trivial.  This function is what I called chisquare_nway
> in ticket #1203; here it is called chisquare_ind, to better reflect
> its purpose.
>
> Note that in the test for independence, an `expected` array is not given;
> it is always computed by the function.
>
>
> def chisquare_fit(obs, expected=None, correction=False):
>    """Chi-square calculation for goodness-of-fit test.
>
>    The number of degrees of freedom is one less than the number of
>    elements in `obs`.
>
>    Parameters
>    ----------
>    obs : array_like
>        The observed frequencies
>
>    expected : array_like, with the same shape as obs (optional)
>        The expected frequencies. If `expected` is not given, the function
>        assumes that the expected frequencies are uniformly distributed.
>        That is,
>            expected = float(obs.sum()) * ones_like(obs) / obs.size
>
>    correction : bool (optional)
>        If True and the number of degrees of freedom is one, Yates'
> correction
>        for continuity is applied.
>
>    Returns
>    -------
>    chi2 : float
>        The Chi-square statistic
>    p : float
>        The p-value
>    dof : int
>        The degrees of freedom; this is obs.size - 1
>    """
>
>
> def chisquare_ind(obs, correction=False):
>    """Chi-square calculation for test of independence.
>
>    Each dimension of `obs` corresponds to a factor.
>    The function will compute the expected frequencies based on the
>    marginal sums of `obs` and assuming independence of the factors.
>    The number degrees of freedom is the number of elements in `obs`,
>    minus the sum of the dimensions, plus the number of dimensions minus 1.
>
>    Parameters
>    ----------
>    obs : array_like
>        The observed frequencies.  Each dimension of `obs` corresponds to
>        a factor.  Note that if `obs` is 1D, this test is trivial: the
>        chi-square statistic will be 0, the p-value will be one, and the
>        expected values will be the same as `obs`.
>
>    correction : bool (optional)
>        If True and the number of degrees of freedom is one, Yates'
> correction
>        for continuity is applied.
>
>    Returns
>    -------
>    chi2 : float
>        The Chi-square statistic
>    p : float
>        The p-value
>    dof : int
>        The degrees of freedom
>    expected : ndarray (same shape as `obs`)
>        The expected frequencies
>    """
>
>
> Each separate function has a somewhat simpler API compared to the
> single function.  I felt I had to write more about `expected` in the
> single function, because its role is conditional--it depends on its
> shape.  So, following the principle that "Simple is better than complex",
> I have a slight preference for the two separate functions.  However,
> now that I've written the docstring, I don't think the single function
> is terrible.
>
> When applied to the use cases, there is only a trivial difference in
> the APIs.  For example, using the single function:
>
> 1.
>  >>> obs = [3,  17,   8,  13,   7,  12]
>  >>> chisquare(obs)
>
> 2.
>  >>> obs = [16, 29, 55]
>  >>> expected  = [5, 25, 70]
>  >>> chisquare(obs, expected)
>
> 3.
>  >>> obs = [[23, 31], [19, 27]]
>  >>> chisquare(obs, correction=True)
>
> 4.
>  >>> obs = [...]
>  >>> chisquare(obs)
>
>
> To use the two functions instead, just use chisquare_fit in the first
> two cases and chisquare_ind in the second two, with exactly the same
> arguments.
>
>
> Anyone still here?  Still awake?  Thoughts?  Preferences?

Forget any merging of the functions.

Statistical functions should also be defined by their purpose, we are
not creating universal f_tests and t_tests. Unless someone is
proposing the merge and unify various t_tests, ... ?
misquoting: "The user's hypothesis is totally irrelevant ..." ???

Testing for goodness-of-fit is a completely different use case, with
different extensions, e.g. power discrepancy. What if I have a 2d
array and want to check goodness-of-fit along each axis, which might
be useful once group-by extensions to bincount handle more than 1d
weights. Or if we extend it to multivariate distributions, then the
default might be uniform for each column (and not independence.)
This is a standard test for distributions, and should not be mixed
with contingency tables

contingency tables are a different case, which I never use, and where
I would go with whatever statisticians prefer. But I think, going by
null hypothesis makes functions for statistical tests much cleaner
(easier to categorize, explain, find) than one-stop statistics (at
least for functions and not methods in classes) as is the current
tradition of scipy.stats.

"fit" in your function name is very misleading chisquare_fit, because
your function doesn't do any fitting. If a rename is desired, I would
call it chisquare_gof, but I use a similar name for the actual gof
test based on the sample data, with automatic binning.
Fitting the distribution parameters raises other issues which I don't
think should be mixed with the basic chisquare-test

Josef
(sent from the Road without Apple product)

>
>
> Warren
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>