[Tutor] fast sampling with replacement

Dave Angel davea at ieee.org
Sun Feb 21 15:51:47 CET 2010



Luke Paireepinart wrote:
> Can you explain what your function is doing and also post some test code to
> profile it?
>
> On Sat, Feb 20, 2010 at 10:22 AM, Andrew Fithian <afith13 at gmail.com> wrote:
>
>   
>> Hi tutor,
>>
>> I'm have a statistical bootstrapping script that is bottlenecking on a
>> python function sample_with_replacement(). I wrote this function myself
>> because I couldn't find a similar function in python's random library. This
>> is the fastest version of the function I could come up with (I used
>> cProfile.run() to time every version I wrote) but it's not fast enough, can
>> you help me speed it up even more?
>>
>> import random
>> def sample_with_replacement(list):
>>     l = len(list) # the sample needs to be as long as list
>>     r = xrange(l)
>>     _random = random.random
>>     return [list[int(_random()*l)] for i in r] # using
>> list[int(_random()*l)] is faster than random.choice(list)
>>
>> FWIW, my bootstrapping script is spending roughly half of the run time in
>> sample_with_replacement() much more than any other function or method.
>> Thanks in advance for any advice you can give me.
>>
>> -Drew
>>
>> _______________________________________________
>> Tutor maillist  -  Tutor at python.org
>> To unsubscribe or change subscription options:
>> http://mail.python.org/mailman/listinfo/tutor
>>
>>
>>     
list and l are poor names for locals.  The former because it's a 
built-in type, and the latter because it looks too much like a 1.

You don't say how big these lists are, but I'll assume they're large 
enough that the extra time spent creating the 'l' and 'r' variables is 
irrelevant.

I suspect you could gain some speed by using random.randrange instead of 
multiplying random.random by the length.

And depending on how the caller is using the data, you might gain some 
by returning a generator expression instead of a list.  Certainly you 
could reduce the memory footprint.

I wonder why you assume the output list has to be the same size as the 
input list.  Since you're sampling with replacement, you're not using 
the whole list anyway.  So I'd have defined the function to take a 
second argument, the length of desired array.  And if you could accept a 
generator instead of a list, you don't care how long it is, so let it be 
infinite.

(untested)
def sample(mylist):
    mylistlen = len(mylist)
    randrange = random.randrange
    while True:
          yield mylist[ randrange(0, mylistlen)]

DaveA



More information about the Tutor mailing list