[Tutor] A file containing a string of 1 billion random digits.

Sun Jul 18 10:49:39 CEST 2010

On Sat, Jul 17, 2010 at 18:01, Steven D'Aprano <steve at pearwood.info> wrote:

> Having generated the digits, it might be useful to look for deviations
> from randomness. There should be approximately equal numbers of each
> digit (100,000,000 each of 0, 1, 2, ..., 9), of each digraph
> (10,000,000 each of 00, 01, 02, ..., 98, 99), trigraphs (1,000,000 each
> of 000, ..., 999) and so forth.

I've been doing a bit of that. I found approx. equal numbers of each
digit (including the zeros :) ). Then I thought I'd look at pairs of
the same digit ('00', '11, and so on). See my
<http://tutoree7.pastebin.com/S9JzmmtY>. The results for the 1 billion
file start at line 78, and look good to me. I might try trigraphs
where the 2nd digit is 2 more than the first, and the third 2 more
than the 2nd. E.g. '024', '135', '791', '802'. Or maybe I've had
enough. BTW Steve, my script avoids the problem you mentioned, of
counting 2 '55's in a '555' string. I get only one, but 2 in '5555'.
See line 18, in the while loop.

I was surprised that I could read in the whole billion file with one
gulp without running out of memory. Memory usage went to 80% (from the
usual 35%), but no higher except at first, when I saw 98% for a few
seconds, and then a drop to 78-80% where it stayed.

> The interesting question is, if you measure a deviation from the
> equality (and you will), is it statistically significant? If so, it is
> because of a problem with the random number generator, or with my
> algorithm for generating the sample digits?

I was pretty good at statistics long ago -- almost became a
statistician -- but I've pretty much lost what I had. Still, I'd bet
that the deviations I've seen so far are not significant.

Thanks for the stimulating challenge, Steve.

Dick