[Python-ideas] random.sample should work better with iterators

Abe Dillon abedillon at gmail.com
Tue Jun 26 20:36:51 EDT 2018


The docs on random.sample indicate that it works with iterators:

> To choose a sample from a range of integers, use a range() 
> <https://docs.python.org/3/library/stdtypes.html#range> object as an 
> argument. This is especially fast and space efficient for sampling from a 
> large population: sample(range(10000000),k=60).


However, when I try to use iterators other than range, like so:

random.sample(itertools.product(range(height), range(with)), 
0.5*height*width)

I get:

TypeError: Population must be a sequence or set. For dicts, use list(d).

I don't know if Python Ideas is the right channel for this, but this seems 
overly constrained. The inability to handle dictionaries is especially 
puzzling.
Randomly sampling from some population is often done because the entire 
population is impractically large which is also a motivation for using 
iterators, so it seems natural that one would be able to sample from an 
iterator. A naive implementation could use a heap queue: 
import heapq
import random

def stream(): 
    while True: yield random.random()

def sample(population, size):
    q = [tuple()]*size
    for el in zip(stream(), population):
        if el > q[0]: heapq.heapreplace(q, el)
    return [el[1] for el in q if el]

It would also be helpful to add a ratio version of the function: 

def sample(population, size=None, *, ratio=None):
    assert None in (size, ratio), "can't specify both sample size and ratio"
    if ratio:
        return [el for el in population if random.random() < ratio]
    ...


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180626/822afa91/attachment.html>


More information about the Python-ideas mailing list