random.sample with large weighted sample-sets?
Tim Chase
python.list at tim.thechases.com
Sun Feb 16 09:22:50 EST 2014
On 2014-02-16 04:12, Terry Reedy wrote:
> On 2/15/2014 11:41 PM, Tim Chase wrote:
> > data = (
> > ("apple", 20),
> > ("orange", 50),
> > ("grape", 30),
> > )
To Ben, yes, this was just some sample data; the original gets built
from an external (i.e., client-supplied, thus the need to gracefully
support crazy-large numbers) data source and is indeed actually a list
rather than a tuple. When entering static data like this, I often
default to an outer tuple (rather than list) just as a hint/reminder
to myself that I don't expect this to change at runtime (and have
Python yell at me if I accidentally try).
> If you actually start with date in this form, write the few lines
> needed to produce the form below.
>
> import bisect
> import random
>
> data = [
> (0, 'apple'),
> (20, 'orange'),
> (70, 'grape'),
> ]
>
> for i in range(10):
> r = random.randrange(0, 100)
> i = bisect.bisect(data, (r, 'zzzzz')) - 1
> print(data[i][1])
Trying to read what may be implicit assumptions in your code:
1) your code calculates "100" as sum(item[0] for item in data)
2) the data has to be sorted for bisect to work
3) you meant to write "(10, 'apple')" rather than 0. With my original
example code, a 0-probability shouldn't ever show up in the sampling,
where it looks like it might when using this sample code. In my
particular use case, I can limit/ensure that 0-probability items never
appear in the list, filtering them upon loading.
4) that "zzzzzz" is some arbitrary value that should come after any
string that could appear in the data; perhaps using some custom
"InfinityString" class where everything compared to it is always less
than it.
So it would be
class InfinityString:
def __gt__(self, other): True
__ge__ = __gt__
def __lt__(self, other): False
__eq__ = __le__ = __ne__ = __lt__
infinity_string = InfinityString()
data = load_data() # list of (quantity, value) tuples
data.sort()
total = sum(qty for qty, value in data)
for i in range(num_to_sample):
r = random.randrange(0, total)
i = bisect.bisect(data, (r, infinity_string)) - 1
use(data[i][1])
Some long-running testing on this code seems to show that if two
items have the same probability, bisect only appears to find the last
one. Tested with
data = [
(10, "apple"),
(20, "banana"), # I never get any bananas, even after thousands of iterations
(20, "grape"),
(50, "orange"),
]
Thanks,
-tkc
More information about the Python-list
mailing list