randomly generate n of each of two types

Mon Feb 12 04:23:04 EST 2007

>
> This again has the "costs" I referred to:
> creating a potentially large sequence,
> and shuffling it.

I thought I would see if I could do better, so I wrote this:

<code>
import random

def index_types(n, typelist=[True, False]):
    numtypes = len(typelist)
    total = float(n*numtypes)
    counts = [0] * numtypes
    while total > 0:
        index = int(random.random() * numtypes)
        if counts[index] < n:
            counts[index] += 1
            total -= 1
            yield typelist[index]

def shuffle_types(n,typelist=[True,False]):
    types = typelist*n
    random.shuffle(types)
    for next_type in types:
        yield next_type

def random_types(n,typelist=[True,False]):
    type0, type1 = typelist
    ct0, ct1 = 0,0
    while ct0+ct1<2*n:
        if random.random() < ((n-ct0)/(2*n-ct0-ct1)):
            next_type = type0
            ct0 += 1
        else:
            next_type = type1
            ct1 += 1
        yield next_type

def test_index(n):
    for x in index_types(n): pass

def test_shuffle(n):
    for x in shuffle_types(n): pass

def test_random(n):
    for x in shuffle_types(n): pass

if __name__ == '__main__':
    from time import sleep
    sleep(10)
    from timeit import Timer
    for function in ['test_index', 'test_shuffle', 'test_random']:
        for nvalue in [1000,10000,100000,1000000]:
            t = Timer("%s(%d)" % (function, nvalue), "from __main__
import %s" % function)
            print function, 'n =', nvalue, ':', t.timeit(number=100)

</code>

Which yielded the following results:

pete at pete-desktop:~$ python test.py
test_index n = 1000 : 0.599834918976
test_index n = 10000 : 5.78634595871
test_index n = 100000 : 56.1441719532
test_index n = 1000000 : 562.577429056
test_shuffle n = 1000 : 0.420483827591
test_shuffle n = 10000 : 4.62663197517
test_shuffle n = 100000 : 46.3557109833
test_shuffle n = 1000000 : 498.563408852
test_random n = 1000 : 0.440869092941
test_random n = 10000 : 4.77765703201
test_random n = 100000 : 47.6845810413
test_random n = 1000000 : 485.233494043

Which shows my function is the slowest (doh!) and your second function
is the fastest. However, shuffle isn't taking much longer to create
and randomize a fairly large list (4.99 secs vs 4.85 secs for your
second function.) In my defense, I wanted to see if I could write the
function to take an arbitrarily long list, rather than limiting it to
2 items.

I also noticed your second function is not working as intended:

>>> r = [x for x in test.random_types(10)]
>>> r
[False, False, False, False, False, False, False, False, False, False,
True, True, True, True, True, True, True, True, True, True]

I think it needs a cast to a float:

<code>
 if random.random() < (float(n-ct0)/(2*n-ct0-ct1)):
</code>

I was too impatient to wait for the n=1000000 results, so here it is
without them:

pete at pete-desktop:~$ python test.py
test_index n = 1000 : 0.583498001099
test_index n = 10000 : 5.80185317993
test_index n = 100000 : 58.8963599205
test_shuffle n = 1000 : 0.431984901428
test_shuffle n = 10000 : 4.75261592865
test_shuffle n = 100000 : 48.1326880455
test_random n = 1000 : 0.697184085846
test_random n = 10000 : 4.41986989975
test_random n = 100000 : 45.7090520859

Once again, the difference is negligible. My question is, does this
function reduce the "randomness" of the data? Doesn't it move the data
towards an alternating pattern of True, False? It seems to me that
calling shuffle, or perhaps chained PRNG as suggested by Steven

> If you really care about shuffling large lists, you can chain PRNGs. For
> instance, you might use shuffle() to generate a random batch of items,
> then use a completely different PRNG to shuffle the list again.

would produce the most "random" data...

Pete