Probability selection algorithm

Sun Feb 2 23:11:03 EST 2003

On Sat, 01 Feb 2003 14:43:29 -0800, Noah wrote:
> [0.17,  0.79,  1.0]
>     if the random number is in the range 0.0-0.17 then select the first
>     item. if the random number is in the range 0.17-0.79 then select the
>     second item. if the random number is in the range 0.79-1.0 then
>     select the third item.
> 
> That seems like a crufty algorithm. There must be a simpler way.

If your inputs range in the size of the examples you've given, that's
probably the fastest algorithms. Clever algorithms tend to have overhead
that overwhelms the eventual savings in the small cases.

Otherwise, the best algorithm in this case strongly depends on whether
you're going to update these probabilities frequently, or if this is a
one-shot computation that will be used for a long time. If it's a
one-shot, the data structure you describe above is still correct (a list
of the cumulative probabilities), and bisect (which does a binary search?)
is right.

If you are frequently updating the probs, the best thing is to store the
probabilities (and ideally corresponding actions) in a binary tree, with
the addition that each node stores the total probability of the children
children added together on each side. Since order really doesn't matter in
this case, the binary tree need not even be sorted in any particular
order, and if you never remove any nodes, it's even easier to write. (Just
keep it reasonably balanced.) I'd recommend not storing the probabilities
directly, but just storing some weight and keeping track of the total at
the top level of the tree, so you can pick from 1 to X, instead of 0 to 1,
and never need to re-weight all the other probabilities. For extra speed
(but only if you find you need it!), a hash table can be used to index
into the nodes extra-swiftly for updates. This algorithm should scale to
anything you could need; you'll run out of memory long before this runs
slowly. (Constructing the tree in the first place is the slowest part.)