[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

Tim Peters report at bugs.python.org
Thu Jul 16 17:23:42 EDT 2020


Tim Peters <tim at python.org> added the comment:

Pro: focus on the "iterable" part of the title. If you want to, e.g., select 3 lines "at random" out of a multi-million-line text file, this kind of reservoir sampling allows to do that holding no more than one line in memory simultaneously. Materializing an iterable in a sequence is sometimes impossible. Yup, it takes O(number of elements) time, but so does anything else that has to look at every element of an iterable (including materializing an iterable in a sequence!).

So this would make most sense as a different function.

Con: over the years we've worked diligently to purge these algorithms of minor biases, leaving them as unbiased as the core generator. Introducing floating-point vagaries goes against that, and results may vary at times due to tiny differences in platform exp() and log() implementations.

Worse, I'm not sure that the _approach_ is beyond reproach: the claim that skips "follow a geometric distribution" is baffling to me. If the reservoir has size K, then when we're looking at the N'th element it should be skipped with probability 1-K/N. But when looking at the N+1'th, that changes to 1-K/(N+1). Then 1-K/(N+2), and so on. These probabilities are all different. In an actual geometric distribution, the skip probability is a constant.

Now 1-K/(N+i) is approximately equal to 1-K/(N+j) for i and j sufficiently close, and for K sufficiently smaller than N, so the skips may be well _approximated_ by a geometric distribution. But that's quite different than saying they _do_ follow a geometric distribution, and I see no reason to trust that the difference can't matter.

The simple algorithm on the Wikipedia page suffers none of these potential problems (it implements an obviously-sound approach, & our `randrange()` is unbiased and platform-independent).  But, for selecting 3 out of a million-element iterable, would invoke the core generator hundreds of thousands of times more often.

----------
nosy: +mark.dickinson, tim.peters

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue41311>
_______________________________________


More information about the Python-bugs-list mailing list