random.sample with large weighted sample-sets?
duncan smith
buzzard at invalid.invalid
Sun Feb 16 11:01:53 EST 2014
On 16/02/14 05:08, Ben Finney wrote:
> Tim Chase <python.list at tim.thechases.com> writes:
>
>> I'm not coming up with the right keywords to find what I'm hunting.
>> I'd like to randomly sample a modestly compact list with weighted
>> distributions, so I might have
>>
>> data = (
>> ("apple", 20),
>> ("orange", 50),
>> ("grape", 30),
>> )
>
> That's not a list, it's a tuple. I think you want a list.
>
> When you want a sequence where each position has a semantic meaning, use
> a tuple (such as ‘("apple", 20)’). Each item has a meaning *because of*
> the position it's in; if the items were in a different order, they'd
> mean different things.
>
> When you want a sequence where the positions don't have a special
> meaning – each item means exactly the same no matter if you change the
> order – that's sometimes called a “homogeneous” sequence, and you want a
> list.
>
> So a “record” should be represented as a tuple, and a “table” of records
> should be represented as a list of tuples:
>
> records = [
> ("apple", 20),
> ("orange", 50),
> ("grape", 30),
> ]
>
>> and I'd like to random.sample() it as if it was a 100-element list.
>
[snip]
That's a description of sampling without replacement. The probabilities
change as items are sampled. e.g. The probability of the first item
being "apple"is 20/100. But the probability that the second sampled item
is "apple" is either 19/99 or 20/99, depending on the value of the first
sampled item. The following (due to Knuth) will generate indices into a
notional list of items.
def indices(n, pop):
# generates indices into a
# population list containing
# items with frequencies in pop
# [("apple", 10), ("orange", 50), ...]
N = sum(tup[1] for tup in pop)
i = m = 0
while m < n:
u = random.random()
if (N-i)*u >= n-m:
i += 1
else:
yield i
i += 1
m += 1
>>> list(indices(3, [("apple", 20),("orange", 50),("grape", 30)]))
[8, 27, 78]
>>>
The indices are generated in order, so it could easily be extended to
generate items or item count pairs.
There might be something more efficient based on the hypergeometric
distribution (generate a number of apples, then a number of oranges
given the number of sampled apples, then a number of grapes given the
number of sampled apples and oranges, etc.).
Duncan
More information about the Python-list
mailing list