[SciPy-User] Fisher exact test, anyone?

Ralf Gommers ralf.gommers at googlemail.com
Sun Nov 21 03:23:15 EST 2010


On Sat, Nov 20, 2010 at 1:35 AM, Bruce Southey <bsouthey at gmail.com> wrote:

> On Wed, Nov 17, 2010 at 7:24 AM, Ralf Gommers
> <ralf.gommers at googlemail.com> wrote:
> >
> >
> > On Wed, Nov 17, 2010 at 8:38 AM, <josef.pktd at gmail.com> wrote:
> >>
> >> On Tue, Nov 16, 2010 at 7:10 PM, Ralf Gommers
> >> <ralf.gommers at googlemail.com> wrote:
> >> >
> >> >
> >> > On Tue, Nov 16, 2010 at 11:45 PM, Bruce Southey <bsouthey at gmail.com>
> >> > wrote:
> >> >>
> >> >> I have no problem including this if we can agree on the API because
> >> >> everything else is internal that can be fixed by release date. So I
> >> >> would
> >> >> accept a place holder API that enable a user in the future to select
> >> >> which
> >> >> tail(s) is performed.
> >> >
> >> > It is always possible to add a keyword "tail" later that defaults to
> >> > 2-tailed. As long as the behavior doesn't change this is perfectly
> fine,
> >> > and
> >> > better than having a placeholder.
> >> >>
> >> >> 1) It just can not use np.asarray() without checking the input first.
> >> >> This
> >> >> is particularly bad for masked arrays.
> >> >>
> >> > Don't understand this. The input array is not returned, only used
> >> > internally. And I can't think of doing anything reasonable with a 2x2
> >> > table
> >> > with masked values. If that's possible at all, it should probably just
> >> > go
> >> > into mstats.
> >> >
> >> >>
> >> >> 2) There are no dimension checking because, as I understand it, this
> >> >> can
> >> >> only handle a '2 by 2' table. I do not know enough for general 'r by
> c'
> >> >> tables or the 1-d case either.
> >> >>
> >> > Don't know how easy it would be to add larger tables. I can add
> >> > dimension
> >> > checking with an informative error message.
> >>
> >> There is some discussion in the ticket about more than 2by2,
> >> additions would be nice (and there are some examples on the matlab
> >> fileexchange), but 2by2 is the most common case and has an unambiguous
> >> definition.
> >>
> >>
> >> >
> >> >>
> >> >> 3) The odds-ratio should be removed because it is not part of the
> test.
> >> >> It
> >> >> is actually more general than this test.
> >> >>
> >> > Don't feel strongly about this either way. It comes almost for free,
> and
> >> > R
> >> > seems to do the same.
> >>
> >> same here, it's kind of traditional to return two things, but in this
> >> case the odds ratio is not the test statistic, but I don't see that it
> >> hurts either
> >>
> >> >
> >> >> 4) Variable names such as min and max should not shadow Python
> >> >> functions.
> >> >
> >> > Yes, Josef noted this already, will change.
> >> >>
> >> >> 5) Is there a reference to the algorithm implemented? For example,
> SPSS
> >> >> provides a simple 2 by 2 algorithm:
> >> >>
> >> >>
> >> >>
> http://support.spss.com/ProductsExt/SPSS/Documentation/Statistics/algorithms/14.0/app05_sig_fisher_exact_test.pdf
> >> >
> >> > Not supplied, will ask on the ticket and include it.
> >>
> >> I thought, I saw it somewhere, but don't find the reference anymore,
> >> some kind of bisection algorithm, but having a reference would be
> >> good.
> >> Whatever the algorithm is, it's fast, even for larger values.
> >>
> >> >>
> >> >> 6) Why exactly does the dtype need to int64? That is, is there
> >> >> something
> >> >> wrong with hypergeom function? I just want to understand why the
> >> >> precision
> >> >> change is required because the input should enter with sufficient
> >> >> precision.
> >> >>
> >> > This test:
> >> > fisher_exact(np.array([[18000, 80000], [20000, 90000]]))
> >> > becomes much slower and gives an overflow warning with int32. int32 is
> >> > just
> >> > not enough. This is just an implementation detail and does not in any
> >> > way
> >> > limit the accepted inputs, so I don't see a problem here.
> >>
> >> for large numbers like this the chisquare test should give almost the
> >> same results, it looks pretty "asymptotic" to me. (the usual
> >> recommendation for the chisquare is more than 5 expected observations
> >> in each cell)
> >> I think the precision is required for some edge cases when
> >> probabilities get very small. The main failing case, I was fighting
> >> with for several days last winter, and didn't manage to fix had a zero
> >> at the first position. I didn't think about increasing the precision.
> >>
> >> >
> >> > Don't know what the behavior should be if a user passes in floats
> >> > though?
> >> > Just convert to int like now, or raise a warning?
> >>
> >> I wouldn't do any type checking, and checking that floats are almost
> >> integers doesn't sound really necessary either, unless or until users
> >> complain. The standard usage should be pretty clear for contingency
> >> tables with count data.
> >>
> >> Josef
> >>
> >
> > Thanks for checking. https://github.com/rgommers/scipy/commit/b968ba17
> > should fix remaining things. Will wait for a few days to see if we get a
> > reference to the algorithm. Then will commit.
>
> Sorry but I don't agree. But I said I do not have time to address this
> and I really do not like adding the code as it is.
>

Bruce, I replied in detail to your previous email, so I'm not sure what you
want me to do here. If you don't have time for more discussion, and Josef
(as stats maintainer) is happy with the addition, I think it can go in.
Actually, it did go in right before your email, but that's doesn't mean it's
too late for some changes.

Cheers,
Ralf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20101121/f251c276/attachment.html>


More information about the SciPy-User mailing list