[SciPy-Dev] Contingency Table Model

Wed Aug 11 15:37:38 EDT 2010

On Wed, Aug 11, 2010 at 3:10 PM, Anthony Scopatz <scopatz at gmail.com> wrote:
>
>
> On Mon, Aug 9, 2010 at 3:35 PM, Bruce Southey <bsouthey at gmail.com> wrote:
>>
>> On 08/09/2010 02:31 PM, Anthony Scopatz wrote:
>>
>> Hello All,
>> I have just opened a ticket
>> (http://projects.scipy.org/scipy/ticket/1258) that adds a general
>> contingency table class to the the stats package.  This class includes
>> methods to slice and collapse the table as well a calculate metrics such as
>> chi-squared and entropy.
>> This implementation came out of Warren Weckesser and me working on this
>> over the SciPy 2010 statistics sprint.
>> Please take a look!  Comments and suggestions are always welcome.
>> Be Well,
>> Anthony
>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
> Hello All,
> I have updated the ticket with new versions of the contingency_table.py and
> test_contingency_table.py.  I also have a github clone of scipy now, if you
> just want to grab the changes, http://github.com/scopatz/scipy
> Issues addressed in the new version:
>
> Expected tables may now be user-specified,
> added from_flat() and to_flat() methods,

a clarification: for from_flat I was thinking about non-rectangular
data when a simple reshape doesn't work.
something like an nd version of
http://mail.scipy.org/pipermail/scipy-dev/2009-March/011592.html

for example when the count data are given in a structured array with
the corresponding group labels where zero count entries might be
missing and which is not necessarily sorted/ordered in the right way
for a reshape.

but now I think this is also handled by from_columns, where the user
specifies the "distribution" as list of (unique) values. (?)

(I haven't looked at the other changes)

Josef

> Retooled the chi_square() method and removed the chisquare_nway() function.
> All table metric methods (entropy) now add the calculated value to the
> contingency table's attributes as well as returning the value.
>
> Bruce, Thank you for your concerns.  I'd like to address your points below.
>
>>
>> 1) You can not use numpy's asarray function without checking the input
>> type. You must be aware of at least masked arrays and Matrix inputs as well
>> as new data types.
>>
>> 2) You can not force a dtype on the user -  on line 54 when you can
>> provide optional precision.
>
> These are handled by now allowing the user to specify their own expected
> table.  The expected_nway() function that these to points relate to can now
> be avoided completely, if desired.
>
>>
>> 3) Can you please clarify lines 112-113?
>> "  scipy.stats.chisquare -- one-way chi-square test (which is not the same
>> as the n-way test with n=1)."
>> This needs to be a little more clear because the exact same test statistic
>> is being used. In fact the function must give the correct answer with 1d
>> array.
>>
>> 4) Related to point 3, lines 72-74 are not correct, see
>> http://en.wikipedia.org/wiki/Pearson's_chi-square_test
>
> The chisquared_nway() function has been removed, so 3) and 4) no longer
> apply.
>
>>
>> 5) You must allow the user to provide their own expected values
>
>
> done.
>
>>
>> 6) Users need to be able to control the output - really I don't want to
>> see the table of expected values unless requested. Also a user might just
>> want the table of expected values and nothing else.
>
> The expected table, much like the probability table or the number of degrees
> of freedom or the number of dimensions, is not really an output.  Rather it
> is more of an attribute that helps calculate outputs, like the entropy,
> mutual information, etc.  Therefore it should always be included in an
> instance of ContingencyTable.  A user could simply have an array of values
> that they call a contingency table, but this class provides a tool for
> easily calculating related metrics (outputs).
>>
>> 7) You should not need the chi2 function.
>
> Now required since chisquared_nway() was removed.
>
>>
>> 8) More generally, what is the need for having an ContingencyTable object?
>
> Basically, my argument for the need is that contingency tables (or cross
> tabulations) are expected as standard in any statistics package.  R has
> them, Matlab has them, SPSS has them, Stata has them, and so on.  I know
> that when I came to scipy.stats and found that they weren't here already, I
> was disappointed.
> I hope this helps!
> Be Well
> Anthony
>
>>
>> Bruce
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>