performance of script to write very long lines of random chars
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Thu Apr 11 01:33:55 EDT 2013
On Thu, 11 Apr 2013 11:45:31 +1000, Chris Angelico wrote:
> On Thu, Apr 11, 2013 at 11:21 AM, gry <georgeryoung at gmail.com> wrote:
>> avail_chrs =
>> '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&
>> \'()*+,-./:;<=>?@[\\]^_`{}'
>
> Is this exact set of characters a requirement? For instance, would it be
> acceptable to instead use this set of characters?
>
> avail_chrs =
> 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/'
>
> Your alphabet has 92 characters, this one only 64... the advantage is
> that it's really easy to work with a 64-character set; in fact, for this
> specific set, it's the standard called Base 64, and Python already has a
> module for working with it. All you need is a random stream of eight-bit
> characters, which can be provided by os.urandom().
I was originally going to write that using the base64 module would
introduce bias into the random strings, but after a little investigation,
I don't think it does.
Or at least, if it does, it's a fairly subtle bias, and not detectable by
the simple technique I used: inspect the mean, and the mean deviation
from the mean.
from os import urandom
from base64 import b64encode
data = urandom(1000000)
m = sum(data)/len(data)
md = sum(abs(v - m) for v in data)/len(data)
print("Mean and mean deviation of urandom:", m, md)
encoded = b64encode(data).strip(b'=')
chars = (b'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdef'
b'ghijklmnopqrstuvwxyz0123456789+/')
values = [chars.index(v) for v in encoded]
m = sum(values)/len(values)
md = sum(abs(v - m) for v in values)/len(values)
print("Mean and mean deviation of encoded data:", m, md)
When I run this, it prints:
Mean and mean deviation of urandom: 127.451652 63.95331188965717
Mean and mean deviation of encoded data: 31.477027511486245
15.991177272527072
I would expect 127 64 and 32 16, so we're pretty close. That's not to say
that there aren't any other biases or correlations in the encoded data,
but after a simplistic test, it looks okay to me.
--
Steven
More information about the Python-list
mailing list