[Python-ideas] random.sample should work better with iterators

Abe Dillon abedillon at gmail.com
Wed Jun 27 15:50:41 EDT 2018


Let me start off by saying I agree with Tim Peters that it would be best to 
implement these changes in a new function (if ever).

On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:
>
> range is not an iterator. 
>

My misunderstanding of the details of range objects was, indeed, a huge 
contributing factor to my confusion. I assumed range was more like a 
generator function when I initially discovered that random.sample doesn't 
permit iterators, however; the reason I'm proposing a version of 
random.sample that accepts iterators is still sound. It's even in the text 
for why one should use a range object:

To choose a sample from a range of integers, use a range() 
> <https://docs.python.org/3/library/stdtypes.html#range> object as an 
> argument. *This is especially fast and space efficient for sampling from 
> a large population*


As I claimed before: A major use-case for sampling is to avoid working with 
an impractically large population. This is also a major use-case for 
iterators.

On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:

> > this seems overly constrained. The inability to handle dictionaries is 
> >  especially puzzling. 


> Puzzling in what way? 
>
> If sample() supported dicts, should it return the keys or the values or 
> both? 

Like in all other contexts where a dictionary is treated as a collection, 
it should be treated as a collection of keys. There are plenty of 
precedence of this:

d = dict(zip(names, ages))
chronological_names = sorted(d, key=d.get)
name_list, name_set = list(d), set(d)
print(*d)


On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:

> Also consider this: 
>
> https://bugs.python.org/issue33098 
> <https://www.google.com/url?q=https%3A%2F%2Fbugs.python.org%2Fissue33098&sa=D&sntz=1&usg=AFQjCNHNqDR8TzlK_TlKFK5Lh39sfgqLJQ>
>  
>
I respectfully disagree with the conclusion of that issue. It goes against 
the "consenting adults" ethos of Python. As long as the performance 
implications are expressly documented and and maybe even a warning thrown, 
I don't see a reason to prevent people from using a useful function. You 
can't protect programmers from writing inefficient programs. Also, It seems 
like the dict interface could expose a way to get a sequence view of the 
keys. This would be very efficient given the current implementation of 
dictionaries in CPython. So, it's not like it's fundamentally impossible 
for random.choice to work efficiently with dicts, it's more of a 
implementation detail.

On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:

> > Randomly sampling from some population is often done because the entire 
> > population is impractically large which is also a motivation for using 
> > iterators, so it seems natural that one would be able to sample from an 
> > iterator. A naive implementation could use a heap queue: 
> > 
> > import heapq 
> > import random 
> > 
> > def stream(): 
> >     while True: yield random.random() 
> > 
> > def sample(population, size): 
> >     q = [tuple()]*size 
> >     for el in zip(stream(), population): 
> >         if el > q[0]: heapq.heapreplace(q, el) 
> >     return [el[1] for el in q if el] 
>
> Is that an improvement over: 
>
> sample(list(itertools.slice(population, size))) 
>
> and if so, please explain.

 
Do you mean: sample(list(itertools.islice(population, size), size)?
If so, then I'll refer you to Tim Peter's response, otherwise: please 
clarify what you meant.

On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:

> > It would also be helpful to add a ratio version of the function: 
> > 
> > def sample(population, size=None, *, ratio=None): 
> >     assert None in (size, ratio), "can't specify both sample size and 
> ratio" 
> >     if ratio: 
> >         return [el for el in population if random.random() < ratio] 
> >     ... 
>
> Helpful under what circumstances? 

 
I wasn't aware of the linear-time reservoir sampling algorithms that Tim 
Peters suggested. Those make the ratio proposal less helpful.
As you can see from the implementation I proposed, the ratio would be able 
to work with iterators of undetermined size in linear time, however; it 
wouldn't satisfy the valid subsampling criteria (unless you shuffle the 
output) and it would only return *roughly* ratio*len(population) elements 
instead of an exact number.

On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:

> Don't let the source speak for itself. Explain what it means. I 
> understand what sample(population, size=100) does. What would 
> sample(population, ratio=0.25) do? 

 
It would return a sample of roughly 25% of the population.

[Stephen J. Turnbull]

> I argue below that *if* we were going to make the change, it should be 
> to consistently try list() on non-sequences.  But "not every 
> one-liner" and EIBTI: 

 
Converting the input to a list is exactly what I'm trying to avoid. I'd 
like to sample from an enormous file that won't fit in memory or populate 
5% of a large game-of-life grid without using up gigabytes of memory:

width, height, ratio = 100000, 100000, 0.05

live_set = {*random.sample(itertools.product(range(height), range(width)) , 
ratio*width*height)}

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180627/b9d88e24/attachment.html>


More information about the Python-ideas mailing list