sampling from frequency distribution / histogram without replacement

Tue Jan 15 12:34:04 EST 2019

On 15/01/2019 02:41, Spencer Graves wrote:
> 
> 
> On 2019-01-14 18:40, duncan smith wrote:
>> On 14/01/2019 22:59, Gregory Ewing wrote:
>>> duncan smith wrote:
>>>> Hello,
>>>>        Just checking to see if anyone has attacked this problem before
>>>> for cases where the population size is unfeasibly large.
>>> The fastest way I know of is to create a list of cumulative
>>> frequencies, then generate uniformly distributed numbers and
>>> use a binary search to find where they fall in the list.
>>> That's O(log n) per sample in the size of the list once it's
>>> been set up.
>>>
>> That's the sort of thing I've been thinking about. But once I'd found
>> the relevant category I'd need to reduce its frequency by 1 and
>> correspondingly update the cumulative frequencies. Alternatively, I
>> could add an extra step where I selected a unit from the relevant
>> category with probability equal to the proportion of non-sampled units
>> from the category. I could maybe set up an alias table and do something
>> similar.
>>
>> The other thing I was thinking about was iterating through the
>> categories (ideally from largest frequency to smallest frequency),
>> generating the numbers to be sampled from the current category and the
>> remaining categories (using numpy.random.hypergeometric). With a few
>> large frequencies and lots of small frequencies that could be quite
>> quick (on average). Alternatively I could partition the categories into
>> two sets, generate the number to be sampled from each partition, then
>> partition the partitions etc. binary search style.
>>
>> I suppose I'll try the both the alias table + rejection step and the
>> recursive partitioning approach and see how they turn out. Cheers.
> 
> 
>       R has functions "sample" and "sample.int";  see
> "https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/sample".
> You can call R from Python,
> "https://sites.google.com/site/aslugsguidetopython/data-analysis/pandas/calling-r-from-python".
> 
> 
> 
>       These are in the "base" package.  I believe they have been an
> important part of the base R language almost since its inception and
> have been used extensively.  You'd have to work really hard to do
> better, in my judgment.
> 
> 
>       Spencer Graves
> 
> 
> DISCLAIMER:  I'm primarily an R guy and only use Python when I can't
> find a sensible way to do what I want in R.
>>
>> Duncan
> 

Despite being a statistician I'm primarily a Python guy and only use R
when I can't find a sensible way to do what I want in Python :-). The
problem with the R solution is that it doesn't seem to get round the
issue of having an unfeasibly large population size, but a reasonable
number of categories. It turns out I've already coded up a few reservoir
based algorithms for sampling without replacement that work with data
streams. So I can get round the space issues, but it still means
processing each data point rather than generating the sample frequencies
directly.

After much searching all I've been able to find is the approach I
suggested above, iterating through the frequencies. My implementation:

import numpy

def hypgeom_variate(freqs, n):
    sample = [0] * len(freqs)
    nbad = sum(freqs)
    hypergeometric = numpy.random.hypergeometric
    for i, ngood in enumerate(freqs):
        nbad -= ngood
        x = hypergeometric(ngood, nbad, n, 1)[0]
        if x:
            sample[i] = x
            n -= x
        if not n:
            break
    return sample

Duncan