Smarter way of doing this?

Tue Feb 3 02:21:57 EST 2004

Max M <maxm at mxm.dk> 
> I have written this function, "choose_by_probability()"
> 
> It takes a list of probabilities in the range 0.00-1.00, and it must 
> then randomly select one of the probabilities and return it's index.
> 
> The idea is to have two lists, one with a value, and another with a 
> probability. The value then gets randomly selected from its probability.
> 
> values = ['item 1','item 2','item 3',]
> probabilities = [0.66, 0.33, 0.16]

Here is a simplified version of the approach you took:

def select(tab, random=random.random):
    cp = 0
    r = random()
    for i, p in enumerate(tab):
        cp += p
        if cp > r:
            return i
    raise Exception('probabilities do not add upto 1.0')

If you need to call this many times and for large table, there are
ways to speed this up.

* re-arrange the tables to make sure the largest probabilities come
first. Intentionally or not, your sample data is already arranged this
way (though the probabilities add up to more than 1.0 and should be
scaled back to 1.0).

* pre-summarize the summations with a cumulative probability table:
  [.50, .25, .15, .10]   --> [.50, .75, .90, 1.00]

def cselect(ctab, random=random.random):
    r = random()
    for i, cp in enumerate(ctab):
        if cp > r:
            return i
    raise Exception

* if the table is cumulative and large, bisection offers a faster
search:

def cselect(ctab, random=random.random, bisect=bisect.bisect):
    return bisect(ctab, random())

* if the probabilities come in integer ratios, you could reduce the
problem to an O(1) lookup:

  [.5, .3, .1, .1] --> [0,0,0,0,0,1,1,1,2,3]

def lselect(ltab, choice=random.choice):
    return choice(ltab)

IOW, if speed is what is important, then be sure to exploit any known
structure in the data (ordering, cumulativeness, integer rations, etc)
and don't compute anything twice (pre-compute the summation).

Raymond Hettinger