Sampling a population

Duncan Smith buzzard at urubu.freeserve.co.uk
Fri Jun 2 13:19:32 EDT 2006


Brian Quinlan wrote:
> This is less a Python question and more a optimization/probability
> question. Imaging that you have a list of objects and there frequency in
> a population e.g.
> 
> lst = [(a, 0.01), (b, 0.05), (c, 0.50), (d, 0.30), (e, 0.04), (f, 0.10)]
> 
> and you want to drawn n items from that list (duplicates allowed), with
> that probability distribution.
> 
> The fastest algorithm that I have been able to devise for doing so is:
> O(n * log(len(lst))). Can anyone think or a solution with a better time
> complexity? If not, is there an obviously better way to do this
> (assuming n is big and the list size is small).
> 
> Here is the code:
> 
> from random import random
> from bisect import bisect
> 
> def draw(n, lst):
>     ps = []
>     last = 0
>     for p in lst:
>         ps.append(last + p)
>         last += p
> 
>     # ps = [0.01, 0.06, 0.56, 0.86, 0.90, 1.00]
> 
>     chosen = [0] * len(lst) # track frequency
>     for i in range(n):
>         r = random()
> 
>         chosen[bisect(ps, r)] += 1 # binary search and increment
> 
>     result = [] # rescale probability based on frequency
>     for c in chosen:
>         result.append(float(c) / n)
>     return result
> 
> lst = [0.01, 0.05, 0.50, 0.30, 0.04, 0.10]
> print draw(10000, lst)
> 

I would do something like the following (maybe with an additional check
that the probabilities do not sum to less than 1),

>>> from random import random
>>> import operator
>>> def draw(n, lst):
	lst.sort(key=operator.itemgetter(1), reverse=True)
	cumprobs = []
	this_cp = 0
	for p in lst:
		this_cp += p[1]
		cumprobs.append(this_cp)
	for _ in xrange(n):
		rnd = random()
		for i, cumprob in enumerate(cumprobs):
			if rnd < cumprob:
				yield lst[i][0]
				break

			
>>> lst = [('a', 0.01), ('b', 0.05), ('c', 0.50), ('d', 0.30), ('e',
0.04), ('f', 0.10)]
>>> list(draw(8, lst))
['d', 'd', 'c', 'e', 'c', 'd', 'c', 'f']
>>>

The sorting of the list means that iterating over the cumulative
probabilities is minimised (for your density function the inner loop
will be broken out of after the first iteration 50% of the time).  (I've
assumed that it doesn't matter to you that the list is sorted.)  I'm not
sure that this is in any way optimal.

Duncan



More information about the Python-list mailing list