[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

Raymond Hettinger report at bugs.python.org
Fri Jul 17 19:34:03 EDT 2020


Raymond Hettinger <raymond.hettinger at gmail.com> added the comment:

I've put more thought into the proposal and am going to recommend against it.

At its heart, this a CPython optimization to take advantage of list() being slower than a handful of islice() calls.  It also gains a speed benefit by dropping the antibias logic because random() is faster than _randbelow().  IMO, this doesn't warrant an API extension.  I'm not looking forward to years of teaching people that there are two separate APIs for sampling without replacement and that the second one is almost never what they should use.

A few years ago, GvR rejected adding a pre-sizing argument to dicts even though there were some cases where it gave improved performance.  His rationale was that it was challenging for a user to know when they were better off and when they weren't.  It added a new complication that easily led to suboptimal choices. IMO, this new API puts the users in a similar situation.  There are a number of cases where a person is worse off, sometimes much worse off.

This new code runs O(n) instead of O(k).  It eats more entropy. It loses the the antibias protections.  The API makes it less explicit that the entire input iterable is consumed.  It can only be beneficial is the input is not a sequence.  When k gets bigger, the repeated calls to islice() become more expensive than a single call to list.   And given the math library functions involved, I not even sure that this code can guarantee it gives the same results across platforms.

Even if the user makes a correct initial decision about which API to use, the decision can become invalidated when the population sizes or sample sizes change over time.

Lastly, giving users choices between two substantially similar tools typically makes them worse off.  It creates a new burden to learn, remember, and distinguish the two.  It's really nice that we currently have just one sample() and that it behaves well across a broad range of cases — you generally get a good result without having to think about it.  Presumably, that was the wisdom behind having one-way-to-do-it.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue41311>
_______________________________________


More information about the Python-bugs-list mailing list