sampling from frequency distribution / histogram without replacement

Ian Hobson hobson42 at gmail.com
Tue Jan 15 12:59:32 EST 2019


Hi,

If I understand your problem you can do it in two passes through the 
population.

First, however, lets work through taking a sample of 2 from 7 to 
demonstrate the method.

Take the first element with a probability of 2/7. (Note 1).
If you took it, you only want 1 more, so the probability drops to 1/6.
If you didn't take it you want 2 from 6, so probability goes to 2/6.
Take the next in the population with probability 1/6 or 2/6 as appropriate.
Continue in similar manner until the probability
drops to 0 (when you have your whole sample). When the
denominator drops to zero the population is expired.

Your first pass has to categorise the population and create your 
histogram, (index N) of frequencies Y(N).

Then divide up the sample size you wish to take into the histogram,
giving array X(N) of sample sizes. X(N) need not be integer.

Then pass through the population again, for each entry:
    Compute the N it falls in the histogram.
    Take this entry as a sample with a probability of X(N)/Y(N).  Note 2.
    If the element was taken, decrement X(N).
    Decrement Y(N).
    step to next element.

Note 1 - In most languages you can generate a pseudo-random number
with a uniform distribution from 0 to Y(N)-1. Take the element if it is 
in range 0 to floor(X(N))-1.

Note 2 - X(N) need not be integer, but you can't actually take a sample 
of 6.5 out of 1000. You will either run out of population having taken 
6, or, if you take 7, the probability will go negative, and no more 
should be taken (treat as zero). The number taken in slot N will be 
floor(X(N)) or ceiling(X(N)). The average over many tries will however 
be X(N).

Sorry I did not come back to you sooner. It took a while to drag the 
method out of my memory from some 35 years ago when I was working on an 
audit package. That was where I learned two things you may be interested in.

1) Auditors significantly under sample. Our Auditors actually took 
samples that were between 10% and 25% of what was necessary to support 
their claims.

2) Very very few standard pseudo-random number generators are actually 
any good.

Regards

Ian

On 14/01/2019 20:11, duncan smith wrote:
> Hello,
>        Just checking to see if anyone has attacked this problem before
> for cases where the population size is unfeasibly large. i.e. The number
> of categories is manageable, but the sum of the frequencies, N,
> precludes simple solutions such as creating a list, shuffling it and
> using the first n items to populate the sample (frequency distribution /
> histogram).
> 
> I note that numpy.random.hypergeometric will allow me to generate a
> sample when I only have two categories, and that I could probably
> implement some kind of iterative / partitioning approach calling this
> repeatedly. But before I do I thought I'd ask if anyone has tackled this
> before. Can't find much on the web. Cheers.
> 
> Duncan
> 

-- 
Ian Hobson
Tel (+351) 910 418 473



More information about the Python-list mailing list