For loop searching takes too long!

Thu Jan 28 22:02:01 EST 2010

Steve Holden wrote:
> Dave Angel wrote:
>> elsa wrote:
>>> Hi guys,
>>>
>>> I've got a problem with my program, in that the code just takes too
>>> long to run. Here's what I'm doing. If anyone has any tips, they'd be
>>> much appreciated!
>>>
>>> So, say I have a list of lists that looks something like this (I'm
>>> using a list of lists, rather than a list of tuples, as I need it to
>>> be mutable):
>>>
>>> myList = [[786,0],[45, 1],[673,1],...................[23,46]]
>>>
>>> there are enough entries in the outer list, that the sum of myList[i]
>>> [0] across all i could be as high as 10^7.
>>>
>>> Now, what I need to do is randomly choose one myList[i], however the
>>> distribution of my random choice needs to be proportional to the
>>> values of myList[i][0]. So, for this list above, I'd have a much
>>> higher chance of choosing myList[0] than myList[1].
>>>
>>> Here is how I'm doing it at the moment:
>>>
>>> def chooseI(myList):
>>>     mySum=0
>>>     choice = random.choice(range(1,sum([i[0] for i in myList])+1))
>>>     for i in range(len(myList)):
>>>         mySum+=myList[i][0]
>>>         if mySum>=choice:
>>>             return i
>>>             break
>>>
>>> This works just fine if sum([i[0] for i in myList]) is less than
>>> 10,000, say. However if its up around 10^7, the whole thing crashes.
>>> Is there a faster way of doing this, that doesn't involve as many
>>> computational steps?
>>>
>>> Thanks!
>>>
>>> elsa
>>>
>>>   
>> "the whole thing crashes"  -- give us the specific crash you see,
>> including the stack trace.  "Crash" is far too vague.
>>
>> At first glance you could be running out of memory.  But not with the
>> values you're quoting.  If a typical average value for i[0] is 50,
>> that's only 200k elements.  Anyway, you could print out the len(myList)
>> to see how big the list is.
>>
>> There are quicker ways to do the sum, but the first thing to change is
>> the random.choice() stuff.  Once you have a sum, you can use
>> random.randint() on it directly.  No need to make a range list.
>>
>> Another approach might be to build a new list with a running sum in it. 
>> Then after doing the randint() on the last (highest) item, you can do a
>> bisect.bisect on that list.  The index that returns will be your return
>> value.
>>
> Your approach seems to  validate an informal impression I had that with
> N choices one ought to be able to build a binary tree where each
> decision went to left or right according to whether a number drawn from
> a linear [0,1) distribution was greater that some arbitrary margin, or not.
> 
> There are then arithmetical questions of exactly how to arrive at the
> correct threshold values for each bifurcation.  If the most probable
> paths are deliberately placed at the top of the tree then the most
> frequent cases will be fastest to generate, also.
> 
Bad form, I know, to reply to oneself, but since I was egotistical
enough to read what I wrote again when it was published I must also
record the conjecture that the resulting alphabet, expressed as the
binary history of the bifurcations chosen for any given symbol, must
surely be a Hamming code to be most efficient.

No proof is adduced.

regards
 Steve
-- 
Steve Holden           +1 571 484 6266   +1 800 494 3119
PyCon is coming! Atlanta, Feb 2010  http://us.pycon.org/
Holden Web LLC                 http://www.holdenweb.com/
UPCOMING EVENTS:        http://holdenweb.eventbrite.com/