performance of script to write very long lines of random chars

Thu Apr 11 01:33:55 EDT 2013

On Thu, 11 Apr 2013 11:45:31 +1000, Chris Angelico wrote:

> On Thu, Apr 11, 2013 at 11:21 AM, gry <georgeryoung at gmail.com> wrote:
>> avail_chrs =
>> '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
>> \'()*+,-./:;<=>?@[\\]^_`{}'
> 
> Is this exact set of characters a requirement? For instance, would it be
> acceptable to instead use this set of characters?
> 
> avail_chrs =
> 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'
> 
> Your alphabet has 92 characters, this one only 64... the advantage is
> that it's really easy to work with a 64-character set; in fact, for this
> specific set, it's the standard called Base 64, and Python already has a
> module for working with it. All you need is a random stream of eight-bit
> characters, which can be provided by os.urandom().

I was originally going to write that using the base64 module would 
introduce bias into the random strings, but after a little investigation, 
I don't think it does.

Or at least, if it does, it's a fairly subtle bias, and not detectable by 
the simple technique I used: inspect the mean, and the mean deviation 
from the mean.

from os import urandom
from base64 import b64encode

data = urandom(1000000)
m = sum(data)/len(data)
md = sum(abs(v - m) for v in data)/len(data)
print("Mean and mean deviation of urandom:", m, md)

encoded = b64encode(data).strip(b'=')
chars = (b'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdef'
         b'ghijklmnopqrstuvwxyz0123456789+/')
values = [chars.index(v) for v in encoded]
m = sum(values)/len(values)
md = sum(abs(v - m) for v in values)/len(values)
print("Mean and mean deviation of encoded data:", m, md)

When I run this, it prints:

Mean and mean deviation of urandom: 127.451652 63.95331188965717
Mean and mean deviation of encoded data: 31.477027511486245 
15.991177272527072

I would expect 127 64 and 32 16, so we're pretty close. That's not to say 
that there aren't any other biases or correlations in the encoded data, 
but after a simplistic test, it looks okay to me.

-- 
Steven