random.sample with large weighted sample-sets?

duncan smith buzzard at invalid.invalid
Sun Feb 16 12:38:38 EST 2014


On 16/02/14 16:35, Charles Allen wrote:
> How efficient does this thing need to be?
>
> You can always just turn it into a two-dimensional sampling problem by
> thinking of the data as a function f(x=item), generating a random x=xr
> in [0,x], then generating a random y in [0,max(f(x))].  The xr is
> accepted if 0 < y <= max(f(xr)), or rejected (and another attempt made) if
> y > max(f(xr)).
>

You can avoid rejection by constructing an alias table. A list can be 
constructed such that each list element contains a pair of values and a 
cutoff. e.g.

[("apple", 20), ("orange", 50), ("grape", 30)]

would become (using one particular algorithm)

[(("apple", "orange"), 0.6),
  (("orange", "apple"), 1.0),
  (("grape", "orange"), 0.9)]

Generate a random index, then select one of the values on the basis of 
the cutoff. For short enough lists you can generate a single 0-1 random 
variate, u, and use int(n*u) for the index and compare n*u - int(n*u) to 
the cutoff, where n is the length of the list. It's still sampling with 
replacement though.

Duncan





More information about the Python-list mailing list