implement random selection in Python

Fri Nov 16 21:39:40 EST 2007

Bruza wrote:
> On Nov 16, 4:47 pm, Bruza <benr... at gmail.com> wrote:
>> On Nov 16, 6:58 am, duncan smith <buzz... at urubu.freeserve.co.uk>
>> wrote:
>>
>>
>>
>>> Bruza wrote:
>>>> I need to implement a "random selection" algorithm which takes a list
>>>> of [(obj, prob),...] as input. Each of the (obj, prob) represents how
>>>> likely an object, "obj", should be selected based on its probability
>>>> of
>>>> "prob".To simplify the problem, assuming "prob" are integers, and the
>>>> sum of all "prob" equals 100. For example,
>>>>    items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]
>>>> The algorithm will take a number "N", and a [(obj, prob),...] list as
>>>> inputs, and randomly pick "N" objects based on the probabilities of
>>>> the
>>>> objects in the list.
>>>> For N=1 this is pretty simply; the following code is sufficient to do
>>>> the job.
>>>> def foo(items):
>>>>     index = random.randint(0, 99)
>>>>     currentP = 0
>>>>     for (obj, p) in items:
>>>>        currentP += w
>>>>        if currentP > index:
>>>>           return obj
>>>> But how about the general case, for N > 1 and N < len(items)? Is there
>>>> some clever algorithm using Python standard "random" package to do
>>>> the trick?
>>> I think you need to clarify what you want to do.  The "probs" are
>>> clearly not probabilities.  Are they counts of items?  Are you then
>>> sampling without replacement?  When you say N < len(items) do you mean N
>>> <= sum of the "probs"?
>>> Duncabn
>> I think I need to explain on the probability part: the "prob" is a
>> relative likelihood that the object will be included in the output
>> list. So, in my example input of
>>
>>   items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]
>>
>> So, for any size of N, 'Tom' (with prob of 45) will be more likely to
>> be included in the output list of N distinct member than 'Mary' (prob
>> of 30) and much more likely than that of 'John' (with prob of 10).
>>
>> I know "prob" is not exactly the "probability" in the context of
>> returning a multiple member list. But what I want is a way to "favor"
>> some member in a selection process.
>>
>> So far, only Boris's solution is closest (but not quite) to what I
>> need, which returns a list of N distinct object from the input
>> "items". However, I tried with input of
>>
>>    items = [('Mary',1), ('John', 1), ('Tom', 1), ('Jane', 97)]
>>
>> and have a repeated calling of
>>
>> Ben
> 
> OOPS. I pressed the Send too fast.
> 
> The problem w/ Boris's solution is that after repeated calling of
> randomPick(3,items), 'Jane' is not the most "frequent appearing"
> member in all the out list of 3 member lists...
> 

So it looks like you are sampling without replacement.  For large N you
don't want to be generating N individual observations.  Do you want
something like the following?

>>> import randvars
>>> randvars.multinomial_variate([0.3, 0.1, 0.45, 0.15], 100)
[34, 8, 40, 18]
>>> randvars.multinomial_variate([0.3, 0.1, 0.45, 0.15], 10000)
[2984, 1003, 4511, 1502]
>>> randvars.multinomial_variate([0.3, 0.1, 0.45, 0.15], 1000000)
[300068, 99682, 450573, 149677]
>>>

The algorithm I use is to generate independent Poisson variates for each
category, with Poisson parameters proportional to the probabilities.
This won't usually give you a total of N, but the sample can be adjusted
by adding / subtracting individual observations.  In fact, I choose a
constant of proportionality that aims to get the total close to, but not
greater than N.  I repeatedly generate Poisson counts until the sample
size is not greater than N, then adjust by adding individual
observations.  (This algorithm might be in Knuth, and might be due to
Ahrens and Dieter.  I can't remember the exact source.)

An efficient way to generate the individual observations is to construct
a cumulative distribution function and use binary search.  I know the
code for this has been posted before.  An alias table should generally
be faster, but my Python coded effort is only similarly fast.

Unfortunately generating Poisson variates isn't straightforward, and the
approach I use requires the generation of gamma and beta variates (both
available in the random module though).  This approach is due to Ahrens
and Dieter and is in Knuth, Vol.2.  So yes, there is a clever algorithm
that uses functions in the random module (Knuth, Ahrens and Dieter are
quite clever).  But it's not simple.

The numpy random module has a function for Poisson variates, which would
make things easier for you (I don't off-hand know which algorithm is
used).  Then it's just a case of choosing an appropriate constant of
proportionality for the Poisson parameters.  Try N - k, where k is the
number of categories, for k <= N**0.5; otherwise, N - N**0.5 - (k -
N**0.5)**0.5.

Duncan