Hypergeometric distribution

Steven D'Aprano steve at REMOVETHIScyber.com.au
Sun Jan 1 04:33:39 EST 2006


On Sat, 31 Dec 2005 16:24:02 -0800, Raven wrote:

> Thanks to all of you guys, I could resolve my problem using the
> logarithms as proposed by Robert.  I needed to calculate the factorial
> for genomic data, more specifically for the number of genes in the
> human genome i.e. about 30.000 and that is a big number :-)
> I didn't know gmpy
> Thanks a lot, really

Are you *sure* the existing functions didn't work? Did you try them?

>>> def log2(x):
...     return math.log(x)/math.log(2)
...
>>> n = 0.0
>>> for i in range(1, 300000):  # ten times bigger than you need
...     n += log2(i)
...
>>> n
5025564.6087276665
>>> t = time.time(); x = 2L**(int(n) + 1); time.time() - t
0.26649093627929688

That's about one quarter of a second to calculate 300,000 factorial
(approximately), and it shows that the calculations are well within
Python's capabilities.

Of course, converting this 1.5 million-plus digit number to a string takes
a bit longer:

>>> t = time.time(); len(str(x)); time.time() - t
1512846
6939.3762848377228

A quarter of a second to calculate, and almost two hours to convert to a
string. Lesson one: calculations on longints are fast. Converting them to
strings is not.

As far as your original question goes, try something like this:

(formula from memory, may be wrong)

>>> def bincoeff(n,r):
...     x = 1
...     for i in range(r+1, n+1):
...             x *= i
...     for i in range(1, n-r+1):
...             x /= i
...     return x
...
>>> bincoeff(10, 0)
1
>>> bincoeff(10, 1)
10
>>> bincoeff(10, 2)
45
>>> bincoeff(10, 3)
120
>>> import time
>>> t = time.time(); L = bincoeff(30000, 7000); time.time() - t
28.317800045013428

Less than thirty seconds to calculate a rather large binomial coefficient
exactly. How many digits?

>>> len(str(L))
7076

If you are calculating hundreds of hypergeometric probabilities, 30
seconds each could be quite painful, but it certainly shows that Python is
capable of doing it without resorting to logarithms which may lose some
significant digits. Although, in fairness, the log function doesn't seem
to lose much accuracy for arguments in the range you are dealing with.


How long does it take to calculate factorials?

>>> def timefact(n):
...     # calculate n! and return the time taken in seconds
...     t = time.time()
...     L = 1
...     for i in range(1, n+1):
...             L *= i
...     return time.time() - t
...
>>> timefact(3000)
0.054913997650146484
>>> timefact(30000)  # equivalent to human genome
5.069951057434082
>>> timefact(300000)  # ten times bigger
4255.2370519638062

Keep in mind, if you are calculating the hypergeometric probabilities
using raw factorials, you are doing way too much work.


-- 
Steven.




More information about the Python-list mailing list