Sampling a population

Fri Jun 2 16:03:16 EDT 2006

Brian Quinlan wrote:
> The fastest algorithm that I have been able to devise for doing so is: 
> O(n * log(len(lst))). Can anyone think or a solution with a better time 
> complexity? If not, is there an obviously better way to do this 
> (assuming n is big and the list size is small).

numpy.searchsorted() can do all of the n lookups in C.

# Untested.

import numpy as np

def draw(n, lst):
    ps = np.cumsum([0.0] + [x[1] for x in lst])
    # watch for floating point errors here. It's likely that the last
    # item won't be quite 1.0

    r = np.random.random(n)  # n psuedorandom numbers uniform on [0, 1)
    indices = np.searchsorted(ps, r)

    # now do what you like with indices, which is an array of indices into
    # lst.

Ed Schofield has an implementation of an algorithm by Marsaglia[1] which turns
sampling into a fast table lookup. If your probabilities have limited precision
(2**-30 or so rather than the full double precision 2**-52 or so), then this
might be an attractive option.

[1]  http://www.jstatsoft.org/v11/i03/v11i03.pdf

The implementation is in the scipy sandbox currently, but I don't think it
builds painlessly at the moment.

  http://svn.scipy.org/svn/scipy/trunk/Lib/sandbox/montecarlo/

Ask on one of the scipy mailing lists if you need help building it.

  http://www.scipy.org/Mailing_Lists

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth."
  -- Umberto Eco