[Numpy-discussion] sample without replacement

Robert Kern robert.kern at gmail.com
Tue Dec 21 14:53:11 EST 2010


On Mon, Dec 20, 2010 at 10:28, Alan G Isaac <alan.isaac at gmail.com> wrote:
> I want to sample *without* replacement from a vector
> (as with Python's random.sample).  I don't see a direct
> replacement for this, and I don't want to carry two
> PRNG's around.  Is the best way something like  this?
>
>        permutation(myvector)[:samplesize]

For one of my personal projects, I copied over the mtrand package and
added a method to RandomState for doing this kind of thing using
reservoir sampling.

http://en.wikipedia.org/wiki/Reservoir_sampling

    def subset_reservoir(self, long nselected, long ntotal, object size=None):
        """ Sample a given number integers from the set [0, ntotal) without
        replacement using a reservoir algorithm.

        Parameters
        ----------
        nselected : int
            The number of integers to sample.
        ntotal : int
            The size of the set to sample from.
        size : int, sequence of ints, or None
            The number of subsets to sample or a shape tuple. An axis of the
            length nselected will be appended to a shape.

        Returns
        -------
        out : ndarray
            The sampled subsets. The order of the items is not necessarily
            random. Use a slice from the result of permutation() if you need the
            order of the items to be randomized.
        """
        cdef long total_size, length, i, j, u
        cdef cnp.ndarray[cnp.int_t, ndim=2] out
        if size is None:
            shape = (nselected,)
            total_size = nselected
            length = 1
        elif isinstance(size, int):
            shape = (size, nselected)
            total_size = size * nselected
            length = size
        else:
            shape = size + (nselected,)
            length = 1
            for i from 0 <= i < len(size):
                length *= size[i]
            total_size = length * nselected
        out = np.empty((length, nselected), dtype=int)
        for i from 0 <= i < length:
            for j from 0 <= j < nselected:
                out[i,j] = j
            for j from nselected <= j < ntotal:
                u = <long>rk_interval(j+1, self.internal_state)
                if u < nselected:
                    out[i,u] = j
        return out.reshape(shape)

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco



More information about the NumPy-Discussion mailing list