[SciPy-User] Fisher exact test, anyone?

Tue Nov 16 10:45:53 EST 2010

On 11/16/2010 07:04 AM, Ralf Gommers wrote:
>
>
> On Mon, Nov 15, 2010 at 12:40 AM, Bruce Southey <bsouthey at gmail.com 
> <mailto:bsouthey at gmail.com>> wrote:
>
>     On Sat, Nov 13, 2010 at 8:50 PM, <josef.pktd at gmail.com
>     <mailto:josef.pktd at gmail.com>> wrote:
>     > http://projects.scipy.org/scipy/ticket/956 and
>     > http://pypi.python.org/pypi/fisher/ have Fisher's exact
>     > testimplementations.
>     >
>     > It would be nice to get a version in for 0.9. I spent a few
>     > unsuccessful days on it earlier this year. But since there are
>     two new
>     > or corrected versions available, it looks like it just needs testing
>     > and a performance comparison.
>     >
>     > I won't have time for this, so if anyone volunteers for this, scipy
>     > 0.9 should be able to get Fisher's exact.
>
> https://github.com/rgommers/scipy/tree/fisher-exact
> All tests pass. There's only one usable version (see below) so I 
> didn't do performance comparison. I'll leave a note on #956 as well, 
> saying we're discussing on-list.
>
>     I briefly looked at the code at pypi link but I do not think it is
>     good enough for scipy. Also, I do not like when people license code as
>     'BSD' and there is a comment in cfisher.pyx  '# some of this code is
>     originally from the internet. (thanks)'. Consequently we can not use
>     that code.
>
>
> I agree, that's not usable. The plain Python algorithm is also fast 
> enough that there's no need to bother with Cython.
>
>
>     The code with ticket 956 still needs work especially in terms of the
>     input types and probably the API (like having a function that allows
>     the user to select either 1 or 2 tailed tests).
>
>
> Can you explain what you mean by work on input types? I used 
> np.asarray and forced dtype to be int64. For the 1-tailed test, is it 
> necessary? I note that pearsonr and spearmanr also only do 2-tailed.
>
> Cheers,
> Ralf
>
I have no problem including this if we can agree on the API because 
everything else is internal that can be fixed by release date. So I 
would accept a place holder API that enable a user in the future to 
select which tail(s) is performed.

1) It just can not use np.asarray() without checking the input first. 
This is particularly bad for masked arrays.

2) There are no dimension checking because, as I understand it, this can 
only handle a '2 by 2' table. I do not know enough for general 'r by c' 
tables or the 1-d case either.

3) The odds-ratio should be removed because it is not part of the test. 
It is actually more general than this test.

4) Variable names such as min and max should not shadow Python functions.

5) Is there a reference to the algorithm implemented? For example, SPSS 
provides a simple 2 by 2 algorithm:
http://support.spss.com/ProductsExt/SPSS/Documentation/Statistics/algorithms/14.0/app05_sig_fisher_exact_test.pdf

6) Why exactly does the dtype need to int64? That is, is there something 
wrong with hypergeom function? I just want to understand why the 
precision change is required because the input should enter with 
sufficient precision.

Bruce
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20101116/2b8f09b5/attachment.html>