[issue44080] Bias in random.choices(long_sequence)

Sat May 8 17:10:38 EDT 2021

New submission from Dennis Sweeney <sweeney.dennis650 at gmail.com>:

Problem: Random.choices() never returns sequence entries at large
odd indices. For example:

>>> import random
>>> [x % 2 for x in random.choices(range(2**53), k=20)]
[0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1]
>>> [x % 2 for x in random.choices(range(2**54), k=20)]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

This might require a 64-bit build.

Possible solutions:

1) Do nothing.

    Document the quirk, and recommend using
    [seq[randrange(len(seq))] for i in range(k)]

2) Use integer arithmetic rather than random().

    Using _randbelow() in the unweighted case alone breaks
    test_random.test_choices_algorithms(), but adding a case for
    `if isinstance(total, int)` and using _randbelow() there as
    well creates a difference between the reproducability
    of weights=[1]*n versus weights=[1.0]*n.

3) Raise an Exception or Warning when loss of precision is likely to occur.

    Something like `if n > 2.0 ** BPF: raise OverFlowError(...)`.
    All smaller integers are exactly representable, but that doesn't
    quite eliminate the bias (see below).

-----------------------

There is bias in random.choices() even when n <= 2**53:

>>> Counter(val % 3 for val in
...         choices(range(2**53 - 2**51), k=1_000_000))
Counter({1: 374976, 0: 333613, 2: 291411})

>>> Counter(val % 3 for val in
...         choices(range(2**53 - 2**51), k=1_000_000)
...         if val < 2**51)
Counter({0: 166669, 1: 83664, 2: 83293})

For some explanation, imagine floats have
only 3 bits of precision. Then random.choices() would do something
like this on a sequence of length 7:

# Random 3-significant-bit floats
def random():
    return randrange(8) / 8

# selecting from a pool of size 2**BPF - 1
>>> Counter(floor(random() * 7) for _ in range(1_000_000))
Counter({0: 249566, 5: 125388, 6: 125251, 2: 125202, 1: 125149, 3: 124994, 4: 124450})

Issue: both 0/8 and 7/8 fall in [0, 1).

Similar biases:

>>> Counter(floor(randrange(16)/16*15) for _ in range(1_000_000))
Counter({0: 124549, 5: 62812, 14: 62807, 6: 62766, 10: 62740, 3: 62716, 12: 62658, 13: 62649, 9: 62473, 8: 62376, 2: 62373, 4: 62346, 11: 62293, 1: 62278, 7: 62164})

>>> Counter(floor(randrange(16)/16*14) for _ in range(1_000_000))
Counter({7: 125505, 0: 124962, 11: 62767, 9: 62728, 6: 62692, 10: 62684, 4: 62625, 3: 62465, 12: 62428, 13: 62397, 2: 62332, 8: 62212, 5: 62176, 1: 62027})

>>> def bias(bits, n):
...     c = Counter(floor(randrange(2**bits)/(2**bits) * n)
...                 for _ in range(1_000_000))
...     m = min(c.values())
...     preferred = [k for k, v in c.items() if v / m > 1.5]
...     preferred.sort()
...     return preferred

>>> bias(bits=4, n=15)
[0]
>>> bias(bits=4, n=14)
[0, 7]
>>> bias(bits=8, n=250)
[0, 41, 83, 125, 166, 208]

# Observation of which numbers may be biased toward
# I have no proof that this is universally true
>>> [250 * i // (2**8 - 250) for i in range(6)]
[0, 41, 83, 125, 166, 208]

If using the OverflowError approach, I think using a threshold of 2**BPF/2
would only make some "bins" (indices) have 1.5x the probability of
others rather than 2x, and thus would not remove the problem either.

----------
components: Library (Lib)
messages: 393285
nosy: Dennis Sweeney, rhettinger
priority: normal
severity: normal
status: open
title: Bias in random.choices(long_sequence)
type: behavior
versions: Python 3.10, Python 3.11

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue44080>
_______________________________________