random.sample with large weighted sample-sets?

Sun Feb 16 11:12:57 EST 2014

Tim Chase wrote:

> On 2014-02-16 04:12, Terry Reedy wrote:
>> On 2/15/2014 11:41 PM, Tim Chase wrote:
>> >    data = (
>> >      ("apple", 20),
>> >      ("orange", 50),
>> >      ("grape", 30),
>> >      )
> 
> To Ben, yes, this was just some sample data; the original gets built
> from an external (i.e., client-supplied, thus the need to gracefully
> support crazy-large numbers) data source and is indeed actually a list
> rather than a tuple. When entering static data like this, I often
> default to an outer tuple (rather than list) just as a hint/reminder
> to myself that I don't expect this to change at runtime (and have
> Python yell at me if I accidentally try).
> 
>> If you actually start with date in this form, write the few lines
>> needed to produce the form below.
>> 
>> import bisect
>> import random
>> 
>> data = [
>>    (0, 'apple'),
>>    (20, 'orange'),
>>    (70, 'grape'),
>> ]
>> 
>> for i in range(10):
>>      r = random.randrange(0, 100)
>>      i = bisect.bisect(data, (r, 'zzzzz')) - 1
>>      print(data[i][1])
> 
> Trying to read what may be implicit assumptions in your code:
> 
> 1) your code calculates "100" as sum(item[0] for item in data)
> 
> 2) the data has to be sorted for bisect to work
> 
> 3) you meant to write "(10, 'apple')" rather than 0.  With my original
> example code, a 0-probability shouldn't ever show up in the sampling,
> where it looks like it might when using this sample code. In my
> particular use case, I can limit/ensure that 0-probability items never
> appear in the list, filtering them upon loading.
> 
> 4) that "zzzzzz" is some arbitrary value that should come after any
> string that could appear in the data; perhaps using some custom
> "InfinityString" class where everything compared to it is always less
> than it.
> 
> So it would be
> 
>   class InfinityString:
>     def __gt__(self, other): True
>     __ge__ = __gt__
>     def __lt__(self, other): False
>     __eq__ = __le__ = __ne__ = __lt__
>   infinity_string = InfinityString()
>   data = load_data() # list of (quantity, value) tuples
>   data.sort()
>   total = sum(qty for qty, value in data)
>   for i in range(num_to_sample):
>     r = random.randrange(0, total)
>     i = bisect.bisect(data, (r, infinity_string)) - 1
>     use(data[i][1])
> 
> Some long-running testing on this code seems to show that if two
> items have the same probability, bisect only appears to find the last
> one.  Tested with
> 
>   data = [
>     (10, "apple"),
>     (20, "banana"), # I never get any bananas, even after thousands of
>     iterations (20, "grape"),
>     (50, "orange"),
>     ]

I think it becomes simpler if you make an intermediate list with the 
cumulated probabilities. You can then avoid the InfinityString gymnastics:

import random, bisect

def cumulated(probs):
    sigma = 0
    cumprobs = []
    for p in probs:
        sigma += p
        cumprobs.append(sigma)
    return cumprobs

def pick(cumprobs):
    return bisect.bisect(cumprobs, random.randrange(cumprobs[-1]))

data = [
    (10, "apple"),
    (20, "banana"),
    (20, "grape"),
    (50, "orange"),
    ]

cumprobs = cumulated(p for p, k in data)

# check for off-by-one bugs
bins = [0] * len(cumprobs)
for i in range(cumprobs[-1]):
    bins[bisect.bisect(cumprobs, i)] += 1
assert bins == [p for p, k in data]

# use it
bins = [0] * len(cumprobs)
for i in range(10000):
    bins[pick(cumprobs)] += 1

for item, bin in zip(data, bins):
    print("{} --> {}".format(item, bin))