[Python-ideas] random.sample should work better with iterators
Abe Dillon
abedillon at gmail.com
Wed Jun 27 15:50:41 EDT 2018
Let me start off by saying I agree with Tim Peters that it would be best to
implement these changes in a new function (if ever).
On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:
>
> range is not an iterator.
>
My misunderstanding of the details of range objects was, indeed, a huge
contributing factor to my confusion. I assumed range was more like a
generator function when I initially discovered that random.sample doesn't
permit iterators, however; the reason I'm proposing a version of
random.sample that accepts iterators is still sound. It's even in the text
for why one should use a range object:
To choose a sample from a range of integers, use a range()
> <https://docs.python.org/3/library/stdtypes.html#range> object as an
> argument. *This is especially fast and space efficient for sampling from
> a large population*
As I claimed before: A major use-case for sampling is to avoid working with
an impractically large population. This is also a major use-case for
iterators.
On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:
> > this seems overly constrained. The inability to handle dictionaries is
> > especially puzzling.
> Puzzling in what way?
>
> If sample() supported dicts, should it return the keys or the values or
> both?
Like in all other contexts where a dictionary is treated as a collection,
it should be treated as a collection of keys. There are plenty of
precedence of this:
d = dict(zip(names, ages))
chronological_names = sorted(d, key=d.get)
name_list, name_set = list(d), set(d)
print(*d)
On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:
> Also consider this:
>
> https://bugs.python.org/issue33098
> <https://www.google.com/url?q=https%3A%2F%2Fbugs.python.org%2Fissue33098&sa=D&sntz=1&usg=AFQjCNHNqDR8TzlK_TlKFK5Lh39sfgqLJQ>
>
>
I respectfully disagree with the conclusion of that issue. It goes against
the "consenting adults" ethos of Python. As long as the performance
implications are expressly documented and and maybe even a warning thrown,
I don't see a reason to prevent people from using a useful function. You
can't protect programmers from writing inefficient programs. Also, It seems
like the dict interface could expose a way to get a sequence view of the
keys. This would be very efficient given the current implementation of
dictionaries in CPython. So, it's not like it's fundamentally impossible
for random.choice to work efficiently with dicts, it's more of a
implementation detail.
On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:
> > Randomly sampling from some population is often done because the entire
> > population is impractically large which is also a motivation for using
> > iterators, so it seems natural that one would be able to sample from an
> > iterator. A naive implementation could use a heap queue:
> >
> > import heapq
> > import random
> >
> > def stream():
> > while True: yield random.random()
> >
> > def sample(population, size):
> > q = [tuple()]*size
> > for el in zip(stream(), population):
> > if el > q[0]: heapq.heapreplace(q, el)
> > return [el[1] for el in q if el]
>
> Is that an improvement over:
>
> sample(list(itertools.slice(population, size)))
>
> and if so, please explain.
Do you mean: sample(list(itertools.islice(population, size), size)?
If so, then I'll refer you to Tim Peter's response, otherwise: please
clarify what you meant.
On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:
> > It would also be helpful to add a ratio version of the function:
> >
> > def sample(population, size=None, *, ratio=None):
> > assert None in (size, ratio), "can't specify both sample size and
> ratio"
> > if ratio:
> > return [el for el in population if random.random() < ratio]
> > ...
>
> Helpful under what circumstances?
I wasn't aware of the linear-time reservoir sampling algorithms that Tim
Peters suggested. Those make the ratio proposal less helpful.
As you can see from the implementation I proposed, the ratio would be able
to work with iterators of undetermined size in linear time, however; it
wouldn't satisfy the valid subsampling criteria (unless you shuffle the
output) and it would only return *roughly* ratio*len(population) elements
instead of an exact number.
On Tuesday, June 26, 2018 at 8:06:35 PM UTC-5, Steven D'Aprano wrote:
> Don't let the source speak for itself. Explain what it means. I
> understand what sample(population, size=100) does. What would
> sample(population, ratio=0.25) do?
It would return a sample of roughly 25% of the population.
[Stephen J. Turnbull]
> I argue below that *if* we were going to make the change, it should be
> to consistently try list() on non-sequences. But "not every
> one-liner" and EIBTI:
Converting the input to a list is exactly what I'm trying to avoid. I'd
like to sample from an enormous file that won't fit in memory or populate
5% of a large game-of-life grid without using up gigabytes of memory:
width, height, ratio = 100000, 100000, 0.05
live_set = {*random.sample(itertools.product(range(height), range(width)) ,
ratio*width*height)}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180627/b9d88e24/attachment.html>
More information about the Python-ideas
mailing list