sampling without replacement

bearophileHUGS at lycos.com bearophileHUGS at lycos.com
Fri Apr 18 09:51:55 EDT 2008


Alexy>But in Python it's very slow...<

I'm the first one to say that CPython is slow, but almost any language
is slow if you use such wrong algorithms like you do.
There are many ways to solve your problem efficiently, one of such
ways, among the simpler ones is to to not modify the original list:

>>> from random import shuffle, seed
>>> items = list("abcdefghijklm")
>>> seed(10)
>>> shuffle(items)
>>> it_items = iter(items)
>>> it_items.next()
'i'
>>> it_items.next()
'd'
>>> it_items.next()
'l'
>>> it_items.next()
'b'
>>> it_items.next()
'j'
>>> it_items.next()
'a'
>>> it_items.next()
'e'

If you don't want to extract the same element twice across different
runs of the program, then you may create a class like this:

from random import shuffle, seed

class Sampler(object):
    def __init__(self, items, init_seed=1):
        self.items = list(items)
        self.last = len(self.items) - 1
        self.init_seed = init_seed
        seed(init_seed)
        shuffle(self.items)
    def __repr__(self):
        return repr(self.items[:self.last+1])
    def next(self):
        if self.last < 0:
            raise StopIteration
        self.last -= 1
        return self.items[self.last+1]
    def save(self, filename):
        pass
        # saves self.last and self.init_seed on disk

samp = Sampler("abcdefghijklm")
print samp
print samp.next()
print samp.next()
print samp.next()

That class code is raw, you may want to improve it in some ways.

Bye,
bearophile



More information about the Python-list mailing list