[Python-ideas] random.sample should work better with iterators

Wed Jun 27 12:58:14 EDT 2018

On Wed, Jun 27, 2018 at 3:11 AM, Antoine Pitrou <solipsis at pitrou.net> wrote:
> On Tue, 26 Jun 2018 23:52:55 -0500
> Tim Peters <tim.peters at gmail.com> wrote:
>>
>> In Python today, the easiest way to spell Abe's intent is, e.g.,
>>
>> >>> from heapq import nlargest # or nsmallest - doesn't matter
>> >>> from random import random
>> >>> nlargest(4, (i for i in range(100000)), key=lambda x: random())
>> [75260, 45880, 99486, 13478]
>> >>> nlargest(4, (i for i in range(100000)), key=lambda x: random())
>> [31732, 72288, 26584, 72672]
>> >>> nlargest(4, (i for i in range(100000)), key=lambda x: random())
>> [14180, 86084, 22639, 2004]
>>
>> That also arranges to preserve `sample()'s promise that all sub-slices of
>> the result are valid random samples too (because `nlargest` sorts by the
>> randomly generated keys before returning the list).
>
> How could slicing return an invalid random sample?

If the sample isn't randomly ordered.

def sample(population, k):
    population = list(population)
    shuffle(population)
    return sorted(population[:k])  #No, don't sort!