random.sample with large weighted sample-sets?

Sun Feb 16 11:01:53 EST 2014

On 16/02/14 05:08, Ben Finney wrote:
> Tim Chase <python.list at tim.thechases.com> writes:
>
>> I'm not coming up with the right keywords to find what I'm hunting.
>> I'd like to randomly sample a modestly compact list with weighted
>> distributions, so I might have
>>
>>    data = (
>>      ("apple", 20),
>>      ("orange", 50),
>>      ("grape", 30),
>>      )
>
> That's not a list, it's a tuple. I think you want a list.
>
> When you want a sequence where each position has a semantic meaning, use
> a tuple (such as ‘("apple", 20)’). Each item has a meaning *because of*
> the position it's in; if the items were in a different order, they'd
> mean different things.
>
> When you want a sequence where the positions don't have a special
> meaning – each item means exactly the same no matter if you change the
> order – that's sometimes called a “homogeneous” sequence, and you want a
> list.
>
> So a “record” should be represented as a tuple, and a “table” of records
> should be represented as a list of tuples:
>
>      records = [
>              ("apple", 20),
>              ("orange", 50),
>              ("grape", 30),
>              ]
>
>> and I'd like to random.sample() it as if it was a 100-element list.
>

[snip]

That's a description of sampling without replacement. The probabilities 
change as items are sampled. e.g. The probability of the first item 
being "apple"is 20/100. But the probability that the second sampled item 
is "apple" is either 19/99 or 20/99, depending on the value of the first 
sampled item. The following (due to Knuth) will generate indices into a 
notional list of items.

def indices(n, pop):
     # generates indices into a
     # population list containing
     # items with frequencies in pop
     # [("apple", 10), ("orange", 50), ...]
     N = sum(tup[1] for tup in pop)
     i = m = 0
     while m < n:
         u = random.random()
         if (N-i)*u >= n-m:
             i += 1
         else:
             yield i
             i += 1
             m += 1

 >>> list(indices(3, [("apple", 20),("orange", 50),("grape", 30)]))
[8, 27, 78]
 >>>

The indices are generated in order, so it could easily be extended to 
generate items or item count pairs.

There might be something more efficient based on the hypergeometric 
distribution (generate a number of apples, then a number of oranges 
given the number of sampled apples, then a number of grapes given the 
number of sampled apples and oranges, etc.).

Duncan