[issue41131] Augment random.choices() with the alias method

Tim Peters report at bugs.python.org
Thu Jul 30 16:27:49 EDT 2020


Tim Peters <tim at python.org> added the comment:

I'm not clear on that the alias method is valuable here. Because of the preprocessing expense, it cries out for a class instead: an object that can retain the preprocessed tables, be saved to disk (pickled), restored later, and used repeatedly to make O(1) selections. I have such a class, which uses exact integer arithmetic (e.g., during preprocessing, there's at least one "too small" bin if and only if there's at least one "too large" bin), and is as unbiased as choice() and randrange(). Reasoning about float vagaries is much harder, and I see no error analysis here.

Most troubling: that due to rounding, reducing a "too large" bin results in a weight that spuriously compares <= bin_size. Then a "too small" bin with a genuinely tiny weight can be left with nothing to pair with, and so ends up aliasing itself, giving it a massively too-high probability of being picked.

So if you're determined to stick to "a function", I'd stick to the simple inverse CDF sampling method.  Unless the number of choices is "very" large, the log factor doesn't much matter.

That said, the alias numerics here could be improved some by working with the weights directly, "normalizing" them only at the end.  Instead of comparing weights to bin_size, compare them instead to the mean:

mean = total / n

(note: the next step to getting rid of floats entirely is to pre-multiply the weights by lcm(total, n) // total  - then their mean is exactly the integer lcm(total, n) // n).

The mean (instead of bin_size) suffers rounding errors then, but the original weights don't during the "hard part" of preprocessing.  At the end,

fn = n + 0.0
U = [weights[i] / total + i / fn for i in range(n)]

instead.  The "i / fn" also suffers just one rounding error as opposed to the two in the current "bin_width * i".  That can make a difference for n as small as 5:

>>> 3/5
0.6
>>> (1/5)*3
0.6000000000000001

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue41131>
_______________________________________


More information about the Python-bugs-list mailing list