sampling without replacement

braver deliverable at gmail.com
Thu Apr 17 02:37:32 EDT 2008


Greetings -- I am doing a sampling without replacement, taking out
random elements from an array.  The picked element is then removed
from the array.  When my arrays were on the order of 10,000 elements
long, everything was fast.  But when I increased them to 1,000,000 it
suddenly was hours.  I tracked it down to the line I first threw in to
cut out the element i:

a = a[:i] + a[i+1:]

It was indeed the weakest link.  But when I replace it with e.g.

a.pop(i)

it is still slow.

I wonder what approach can be taken to speed it up?  Basically, the
algorithm picks an element from the array at random and checks its
size in a different hashtable i=>size.  If the size fits into an
existing accumulator, i.e. is below a threshold, we take it and delete
it from the array as above.  Thus just random.sample is not enough as
we may not use an element we pick until we find the right one, element
by element, running a cumulative statistics.

Using an array is natural here as it represents "without replacement"
-- we take an element by removing it from the array.  But in Python
it's very slow...  What approaches are there to implement a shrinking
array with random deletions with  the magnitude of millions of
elements?

Cheers,
Alexy



More information about the Python-list mailing list