[SciPy-Dev] chi-square test for a contingency (R x C) table

Thu Jun 3 11:22:42 EDT 2010

On Thu, Jun 3, 2010 at 11:05 AM, Bruce Southey <bsouthey at gmail.com> wrote:
> On 06/03/2010 01:48 AM, josef.pktd at gmail.com wrote:
>
> On Wed, Jun 2, 2010 at 4:03 PM, Bruce Southey <bsouthey at gmail.com> wrote:
>
>
> On 06/02/2010 01:41 PM, josef.pktd at gmail.com wrote:
>
> On Wed, Jun 2, 2010 at 2:18 PM, Neil Martinsen-Burrell <nmb at wartburg.edu>
> wrote:
>
>
> On 2010-06-02 13:10 , Bruce Southey wrote:
> [...]
>
>
>
> However, this code is the chi-squared test part as SAS will compute the
> actual cell numbers. Also an extension to scipy.stats.chisquare() so we
> can not have both functions.
>
>
> Again, I don't understand what you mean that we can't have both
> functions? I believe (from a statistics teacher's point of view) that
> the Chi-Squared goodness of fit test (which is stats.chisquare) is a
> different beast from the Chi-Square test for independence (which is
> stats.chisquare_contingency). The fact that the distribution of the
> test statistic is the same should not tempt us to put them into the
> same function.
>
>
> Please read scipy.stats.chisquare() because scipy.stats.chisquare() is
> the 1-d case of yours.
> Quote from the docstring:
> " The chi square test tests the null hypothesis that the categorical data
> has the given frequencies."
> Also go the web site provided in the docstring.
>
> By default you get the expected frequencies but you can also put in your
> own using the f_exp variable. You could do the same in your code.
>
>
> In fact, Warren correctly used stats.chisquare with the expected
> frequencies calculated from the null hypothesis and the corrected
> degrees of freedom.  chisquare_contingency is in some sense a
> convenience method for taking care of these pre-calculations before
> calling stats.chisquare.  Can you explain more clearly to me why we
> should not include such a convenience function?
>
>
> Just a clarification, before I find time to work my way through the
> other comments
>
> stats.chisquare is a generic test for goodness-of-fit for discreted or
> binned distributions.
> and from the docstring of it
> "If no expected frequencies are given, the total
>     N is assumed to be equally distributed across all groups."
>
> default is uniform distribution
>
>
>
> Try:
> http://en.wikipedia.org/wiki/Pearson's_chi-square_test
>
> The use of the uniform distribution is rather misleading and technically
> wrong as it does not help address the expected number of outcomes in a cell:
>
>
> quote from the wikipedia page:
> "A simple example is the hypothesis that an ordinary six-sided dice is
> "fair", i.e., all six outcomes are equally likely to occur."
>
> I don't see anything misleading or technically wrong with the uniform
> distributions,
> or if they come from a Poisson, Hypergeometric, binned Normal or any
> of number of other distributions.
>
>
> Okay this must be only for the 1-way table as it does not apply to the 2-way
> or higher tables where the test is for independence between variables.

I'm talking about a completely different strand of literature, e.g. a
commercial program specialized on this
http://www.mathwave.com/articles/goodness_of_fit.html#cs

And never think of tables when I look at goodness-of-fit tests. I
haven't seen yet a case where the asymptotic results for the chisquare
test doesn't apply.

>
> There are valid technical reasons why it is misleading because saying that a
> random variable comes from some distribution has immutable meaning.
> Obviously if a random variable comes from the discrete uniform distribution
> then that random variable also must have a mean (N+1)/2,  variance
> (N+1)*(N-1)/12 etc. There is nothing provided about the moments of the
> random variable provided under the null hypothesis so you can not say what
> distribution that a random variable is from. For example, the random
> variable could be from a beta-binomial distribution (as when alpha=beta=1
> this is the discrete uniform) or binomial/multinomial with equal
> probabilities such that the statement 'all [the] outcomes are equally likely
> to occur' remains true.
>
> If you assume that your random variables are discrete uniform or any other
> distribution (except normal) then in general you can not assume that the
> Pearson's chi-squared test statistic has a specific distribution. However,
> in this case the Pearson's chi-squared test statistic is asymptotically
> chi-squared because of the normality assumption. So provided the central
> limit theorem is valid (not necessarily true for all distributions and for
> 'small' sample sizes) then this test will be asymptotically valid regardless
> of the assumption of the random variables in this case.
>
> http://en.wikipedia.org/wiki/Discrete_uniform_distribution
>
>
> chisquare_twoway is a special case that additional calculates the
> correct expected frequencies for the test of independencs based on the
> margin totals. The resulting distribution is not uniform.
>
>
> Actually the null hypothesis is rather different between 1-way and 2-way
> tables so you can not say that chisquare_twoway is a special case of
> chisquare.
>
>
> What is the Null hypothesis in a one-way table?
>
> Josef
>
>
>
> SAS definition for 1-way table: "the null hypothesis specifies equal
> proportions of the total sample size for each class". This is not the same
> as saying a discrete uniform distribution as you are not directly testing
> that the cells have equal probability. But the ultimate outcome is probably
> not any different.

Ok, I will have to look at this (when I have time), in my opinion this
is inconsistent with the interpretation of a test for independence in
a two-way or three-way table.

Josef

>
> Bruce
>
>
> I am not sure what you mean by the 'resulting distribution is not uniform'.
> The distribution of the cells values has nothing to do with the uniform
> distribution in either case because it is not used in the data nor in the
> formulation of the test. (And, yes, I have had to do the proof that the test
> statistic is Chi-squared - which is why there is the warning about small
> cells...).
>
> I agree with Neil that this is a very useful convenience function.
>
>
> My problem with the chisquare_twoway is that it should not call another
> function to finish two lines of code. It is just an excessive waste of
> resources.
>
> I never heard of a one-way contingency table, my question was whether
> the function should also handle 3-way or 4-way tables, additional to
> two-way.
>
>
> Correct to both of these as I just consider these as n-way tables. I think
> that contingency tables by definition only applies to the 2-d case. Pivot
> tables are essentially the same thing. I would have to lookup on how to get
> the expected number of outcomes but probably of the form Ni.. * N.j.
> *N..k/N... for the 3-way (the 2-way table is of the form Ni.*N.j/N..) for
> i=rows, j=columns, k=3rd axis and '.' means sum for that axis.
>
> I thought about the question how the input should be specified for my
> initial response, the alternative would be to use the original data or
> a "long" format instead of a table. But I thought that as a
> convenience function using the table format will be the most common
> use.
>
> I have written in the past functions that calculate the contingency
> table, and would be very useful to have a more complete coverage of
> tools to work with contingency tables in scipy.stats (or temporarily
> in statsmodels, where we are working also on the anova type of
> analysis)
>
>
> It depends on what tasks are needed.  Really there are two steps:
> 1) Cross-tabulation that summarized the data from whatever input (groupby
> would help here).
> 2) Statistical tests - series of functions that accept summarized data only.
>
> If you have separate functions then the burden is on the user to find and
> call all the desired functions. You can also provide a single helper
> function to do all that because you don't want to repeat unnecessary calls.
>
> So, I think the way it is it is a nice function and we don't have to
> put all contingency table analysis into this function.
>
> Josef
>
>
> Bruce
>
>
>
>
>
> Really this should be combined with fisher.py in ticket 956:
> http://projects.scipy.org/scipy/ticket/956
>
>
> Wow, apparently I have lots of disagreements today, but I don't think
> that this should be combined with Fisher's Exact test. (I would like
> to see that ticket mature to the point where it can be added to
> scipy.stats.) I like the functions in scipy.stats to correspond in a
> one-to-one manner with the statistical tests. I think that the docs
> should "See Also" the appropriate exact (and non-parametric) tests,
> but I think that one function/one test is a good rule. This is
> particularly true for people (like me) who would like to someday be
> able to use scipy.stats in a pedagogical context.
>
> -Neil
>
>
> I don't see any 'disagreements' rather just different ways to do things
> and identifying areas that need to be addressed for more general use.
>
>
> Agreed. :)
>
> [...]
>
> -Neil
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>