[SciPy-User] Off by one bug in Scipy.stats.hypergeom

Sun Aug 1 05:33:27 EDT 2010

On Sun, Aug 1, 2010 at 1:16 AM, Jacob Biesinger
<jake.biesinger at gmail.com> wrote:
> I have seen the error of my ways and apologize for the noise in the list-- I
> should have been using the probability mass function instead of the survival
> function.  Doing the summation manually shows that all is well:
> def hypergeom(x, fgTotal, bgMatching, bgTotal):
>         return scipy.comb(fgTotal,x) *
> scipy.comb(bgTotal-fgTotal,bgMatching-x) / scipy.comb(bgTotal,bgMatching)
>>>> scipy.stats.hypergeom.sf(4,50,5,10, loc=1)
> 0.0040835205497095073
>>>> sum(scipy.stats.hypergeom.pmf(x,50,5,10) for x in
>>>> range(4,min(10+1,50+1)))
> 0.0040835205497557125
>>>> sum([hypergeom(x, 10, 5, 50) for x in scipy.arange(4, min(10+1, 50+1))])
> 0.0040835205497557186
>
>>>> scipy.stats.hypergeom.sf(5,50,5,10, loc=1)
> 0.00011893749169422652
>>>> sum(scipy.stats.hypergeom.pmf(x,50,5,10) for x in
>>>> range(5,min(10+1,50+1)))
> 0.00011893749174045688
>>>> sum([hypergeom(x, 10, 5, 50) for x in scipy.arange(5, min(10+1, 50+1))])
> 0.00011893749174045679

Note that the distribution methods are vectorized. You can use an
array_like for x instead of the list comprehension.

Josef

>
> Thanks anyway!
> --
> Jake Biesinger
> Graduate Student
> Xie Lab, UC Irvine
> (949) 231-7587
>
>
> On Sat, Jul 31, 2010 at 5:48 PM, Jacob Biesinger <jake.biesinger at gmail.com>
> wrote:
>>
>> Hi!
>> Perhaps I'm using the module incorrectly, but it looks like the x
>> parameter in scipy.stats.hypergeom is off by one.  Specifically, I think
>> it's one-too-high.
>> From the wikipedia article
>> http://en.wikipedia.org/wiki/Hypergeometric_distribution#Application_and_example (I
>> know they could be wrong-- just hear me out on this),
>> scipy.stats.hypergeom?
>> Hypergeometric distribution
>>
>>        Models drawing objects from a bin.
>>        M is total number of objects, n is total number of Type I objects.
>>        RV counts number of Type I objects in N drawn without replacement
>> from
>>        population.
>> So translating wikipedia's example...
>> Pr(x=4; M=50, n=5, N=10)  = (choose(5,4) * choose(50-5, 10-4)) /
>> choose(50,10) = .003964583
>> Pr(x=5; M=50, n=5, N=10)  = (choose(5,5) * choose(50-5, 10-5)) /
>> choose(50,10) = .0001189375
>> Which you can check with the python code:
>> from scipy import comb as chse   # "combination" => choose
>>
>> float((chse(5,4, exact=1) * chse(50-5,10-4, exact=1))) /
>> chse(50,10,exact=1)  # example one
>> 0.0039645830580150654155  # okay!
>> float((chse(5,5, exact=1) * chse(50-5,10-5, exact=1))) /
>> chse(50,10,exact=1) # example two
>> 0.00011893749174045196247  # okay!
>> Try example one with scipy.stats.hypergeom:
>> # scipy.stats.hypergeom.sf(x, M, n, N)
>> scipy.stats.hypergeom.sf(4,50,5,10)
>> 0.00011893749169422652     # correct value for x=5, not x=4
>> scipy.stats.hypergeom.sf(5,50,5,10)
>> -4.6185277824406512e-14    # wrong
>> It seems that changing the loc value from =0 (default) to =1 fixes the
>> issue...
>> scipy.stats.hypergeom.sf(4,50,5,10, loc=1)
>> 0.0040835205497095073    # close enough
>> scipy.stats.hypergeom.sf(5,50,5,10, loc=1)
>> 0.00011893749169422652   # okay!
>> Am I using the package wrong?
>> --
>> Jake Biesinger
>> Graduate Student
>> Xie Lab, UC Irvine
>> (949) 231-7587
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>