sampling from frequency distribution / histogram without replacement

Mon Jan 14 19:40:25 EST 2019

On 14/01/2019 22:59, Gregory Ewing wrote:
> duncan smith wrote:
>> Hello,
>>       Just checking to see if anyone has attacked this problem before
>> for cases where the population size is unfeasibly large.
> 
> The fastest way I know of is to create a list of cumulative
> frequencies, then generate uniformly distributed numbers and
> use a binary search to find where they fall in the list.
> That's O(log n) per sample in the size of the list once it's
> been set up.
> 

That's the sort of thing I've been thinking about. But once I'd found
the relevant category I'd need to reduce its frequency by 1 and
correspondingly update the cumulative frequencies. Alternatively, I
could add an extra step where I selected a unit from the relevant
category with probability equal to the proportion of non-sampled units
from the category. I could maybe set up an alias table and do something
similar.

The other thing I was thinking about was iterating through the
categories (ideally from largest frequency to smallest frequency),
generating the numbers to be sampled from the current category and the
remaining categories (using numpy.random.hypergeometric). With a few
large frequencies and lots of small frequencies that could be quite
quick (on average). Alternatively I could partition the categories into
two sets, generate the number to be sampled from each partition, then
partition the partitions etc. binary search style.

I suppose I'll try the both the alias table + rejection step and the
recursive partitioning approach and see how they turn out. Cheers.

Duncan