[issue37682] random.sample should support iterators

Fri Jul 26 02:09:21 EDT 2019

Raymond Hettinger <raymond.hettinger at gmail.com> added the comment:

ISTM that if a generator produces so much data that it is infeasible to fit in memory, then it will also take a long time to loop over it and generate a random value for each entry.  

FWIW, every time we've looked at reservoir sampling it has been less performant than what we have now.  The calls to randbelow() are the slowest part, so doing more calls makes the overall performance worse.  Also, doing more calls eats more entropy.  

In general, it is okay for functions to accept only sequences if they exploit indexing in some way.  For example, the current approach works great with sample(range(100_000_000_000), k=50).  We really don't have to make everything accept all iterators.  Besides, it is trivially easy to call list() if needed.

Overall, I'm -1 on redesigning the sampling algorithm to accommodate non-sequence iterators.  AFAICT, it isn't important at all and as Serhiy pointed out, writing your own reservoir sampling is easy do.  Lastly, the standard library doesn't try to be all things to all people, it is okay to leave many things for external packages -- we mostly provide a baseline of tools that cover common use cases and defer the rest to the Python ecosystem.

----------
assignee:  -> rhettinger
versions:  -Python 2.7, Python 3.5, Python 3.6, Python 3.7, Python 3.8

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue37682>
_______________________________________