sampling from frequency distribution / histogram without replacement

Fri Jan 18 21:29:54 EST 2019

On 14/01/2019 20:11, duncan smith wrote:
> Hello,
>       Just checking to see if anyone has attacked this problem before
> for cases where the population size is unfeasibly large. i.e. The number
> of categories is manageable, but the sum of the frequencies, N,
> precludes simple solutions such as creating a list, shuffling it and
> using the first n items to populate the sample (frequency distribution /
> histogram).
> 
> I note that numpy.random.hypergeometric will allow me to generate a
> sample when I only have two categories, and that I could probably
> implement some kind of iterative / partitioning approach calling this
> repeatedly. But before I do I thought I'd ask if anyone has tackled this
> before. Can't find much on the web. Cheers.
> 
> Duncan
> 

After much tinkering I came up with the following:

import numpy as np

def hypgeom_variate(freqs, n):
    # recursive partitioning approach
    sample = [0] * len(freqs)
    cumsum = np.cumsum(list(chain([0], freqs)))
    if n > cumsum[-1]:
        raise ValueError('n cannot be greater than population size')
    hypergeometric = np.random.hypergeometric
    argslist = [(0, len(freqs), 0, cumsum[-1], n)]
    for i, k, ci, ck, m in argslist:
        if k == i + 1:
            sample[i] = m
        else:
            j = (i + k) // 2
            cj = cumsum[j]
            x = hypergeometric(cj - ci, ck - cj, m, 1)[0]
            y = m-x
            if x:
                argslist.append((i, j, ci, cj, x))
            if y:
                argslist.append((j, k, cj, ck, y))
    return sample

Cheers.

Duncan