Testing random

Steven D'Aprano steve at pearwood.info
Sun Jun 7 14:24:14 EDT 2015


On Mon, 8 Jun 2015 01:51 am, Thomas 'PointedEars' Lahn wrote:

> Chris Angelico wrote:
> 
>> On Sun, Jun 7, 2015 at 8:40 PM, Thomas 'PointedEars' Lahn
>> <PointedEars at web.de> wrote:
>>> Cecil Westerhof wrote:
>>>> I wrote a very simple function to test random:
>>>>     def test_random(length, multiplier = 10000):
>>>>         number_list = length * [0]
>>>>         for i in range(length * multiplier):
>>>>             number_list[random.randint(0, length - 1)] += 1
>>>>         minimum = min(number_list)
>>>>         maximum = max(number_list)
>>>>         return (minimum, maximum, minimum / maximum)
>>>
>>> As there is no guarantee that every number will occur randomly, using a
>>> dictionary at first should be more efficient than a list:
>> 
>> Hmm, I'm not sure that's actually so. His code is aiming to get
>> 'multiplier' values in each box; for any serious multiplier (he starts
>> with 10 in the main code), you can be fairly confident that every
>> number will come up at least once.
> 
> The wording shows a common misconception: that random distribution would
> mean that it is guaranteed or more probable that every element of the set
> will occur at least once.  It is another common misconception that
> increasing the number of trials would increase the probability of that
> happening.  But that is not so.

You are badly mistaken.

Of course you are correct that a random distribution does not *guarantee*
that every element of the set is seen at least once. But as soon as you
call it "a common misconception" that increasing the number of trials
increases the probability of seeing every element, you are simply mistaken.
It's not a misconception at all, it is true. See below for calculations.


[...]
> If the distribution is even and the game is fair, every outcome *always*
> has
> the same probability of occurring.  As a result, every sequence of
> outcomes of equal length *always* has the same probability of occurring,
> and the probability for a particular sequence of equal length does _not_
> increase or
> decrease based on the number of occurrences of previous outcomes.  Those
> are *independent* events.

You are confused. Think about it more carefully.

To make it simple, let's randomly generate from the set 1, 2, 3, and do just
three samples. That's small enough to enumerate all the possibilities:

111 112 113 121 122 123 131 132 133
211 212 213 221 222 223 231 232 233
311 312 313 321 322 323 331 332 333

Each event (111 etc) has a probability of 1/27, but that's not important.
We're comparing the probability of all elements being seen, versus at least
one element being missed. I'm going to tag the events that contain all
three elements:

111 112 113 121 122 123* 131 132* 133
211 212 213* 221 222 223 231* 232 233
311 312* 313 321* 322 323 331 332 333

So there are six events out of 27, or 2/9, where all elements are seen in a
trial, versus 7/9 where at least one element will have a count of zero.

Putting it another way:

Pr(at least one element has zero count)
= Pr(1 has zero count) + Pr(2 has zero count) + Pr(3 has zero count)
  - Pr(1 and 2 both have zero counts) 
  - Pr(1 and 3 both have zero counts) 
  - Pr(2 and 3 both have zero counts)

= (2/3)**3 * 3 - (1/3)**3 * 3

which gives us 7/9 as confirmed by direct enumeration of all the
possibilities.

With four samples per trial:
Pr(at least one element has zero count) = 5/9

With five samples:
Pr(at least one element has zero count) = 31/81

Ten samples:
Pr(at least one element has zero count) = 341/6561 
or approximately 0.05

Twenty samples:
Pr(at least one element has zero count) = 349525/387420489
or approximately 0.0009

For 100 samples:
Pr(at least one element has zero count) = 7.4e-18 (approx)


which means that with 100 samples from the set (1, 2, 3), if you could run
one million trials every second, it would on average take you almost 4300
years to see a trial that had a zero count for one or more of the elements.

The probability of any element *not* showing up in a random trial with N
samples decreases rapidly as N increases. Elements with zero count are only
a factor when the number of samples is small, or the number of elements is
large.




-- 
Steven




More information about the Python-list mailing list