random.sample with large weighted sample-sets?

Sun Feb 16 09:22:50 EST 2014

On 2014-02-16 04:12, Terry Reedy wrote:
> On 2/15/2014 11:41 PM, Tim Chase wrote:
> >    data = (
> >      ("apple", 20),
> >      ("orange", 50),
> >      ("grape", 30),
> >      )

To Ben, yes, this was just some sample data; the original gets built
from an external (i.e., client-supplied, thus the need to gracefully
support crazy-large numbers) data source and is indeed actually a list
rather than a tuple. When entering static data like this, I often
default to an outer tuple (rather than list) just as a hint/reminder
to myself that I don't expect this to change at runtime (and have
Python yell at me if I accidentally try).

> If you actually start with date in this form, write the few lines
> needed to produce the form below.
> 
> import bisect
> import random
> 
> data = [
>    (0, 'apple'),
>    (20, 'orange'),
>    (70, 'grape'),
> ]
> 
> for i in range(10):
>      r = random.randrange(0, 100)
>      i = bisect.bisect(data, (r, 'zzzzz')) - 1
>      print(data[i][1])

Trying to read what may be implicit assumptions in your code:

1) your code calculates "100" as sum(item[0] for item in data)

2) the data has to be sorted for bisect to work

3) you meant to write "(10, 'apple')" rather than 0.  With my original
example code, a 0-probability shouldn't ever show up in the sampling,
where it looks like it might when using this sample code. In my
particular use case, I can limit/ensure that 0-probability items never
appear in the list, filtering them upon loading.

4) that "zzzzzz" is some arbitrary value that should come after any
string that could appear in the data; perhaps using some custom
"InfinityString" class where everything compared to it is always less
than it.

So it would be

  class InfinityString:
    def __gt__(self, other): True
    __ge__ = __gt__
    def __lt__(self, other): False
    __eq__ = __le__ = __ne__ = __lt__
  infinity_string = InfinityString()
  data = load_data() # list of (quantity, value) tuples
  data.sort()
  total = sum(qty for qty, value in data)
  for i in range(num_to_sample):
    r = random.randrange(0, total)
    i = bisect.bisect(data, (r, infinity_string)) - 1
    use(data[i][1])

Some long-running testing on this code seems to show that if two
items have the same probability, bisect only appears to find the last
one.  Tested with

  data = [
    (10, "apple"),
    (20, "banana"), # I never get any bananas, even after thousands of iterations
    (20, "grape"),
    (50, "orange"),
    ]

Thanks,

-tkc