[Python-ideas] random.sample should work better with iterators
Abe Dillon
abedillon at gmail.com
Tue Jun 26 20:36:51 EDT 2018
The docs on random.sample indicate that it works with iterators:
> To choose a sample from a range of integers, use a range()
> <https://docs.python.org/3/library/stdtypes.html#range> object as an
> argument. This is especially fast and space efficient for sampling from a
> large population: sample(range(10000000),k=60).
However, when I try to use iterators other than range, like so:
random.sample(itertools.product(range(height), range(with)),
0.5*height*width)
I get:
TypeError: Population must be a sequence or set. For dicts, use list(d).
I don't know if Python Ideas is the right channel for this, but this seems
overly constrained. The inability to handle dictionaries is especially
puzzling.
Randomly sampling from some population is often done because the entire
population is impractically large which is also a motivation for using
iterators, so it seems natural that one would be able to sample from an
iterator. A naive implementation could use a heap queue:
import heapq
import random
def stream():
while True: yield random.random()
def sample(population, size):
q = [tuple()]*size
for el in zip(stream(), population):
if el > q[0]: heapq.heapreplace(q, el)
return [el[1] for el in q if el]
It would also be helpful to add a ratio version of the function:
def sample(population, size=None, *, ratio=None):
assert None in (size, ratio), "can't specify both sample size and ratio"
if ratio:
return [el for el in population if random.random() < ratio]
...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180626/822afa91/attachment.html>
More information about the Python-ideas
mailing list