For loop searching takes too long!

Thu Jan 28 22:09:06 EST 2010

Steve Holden wrote:
> Dave Angel wrote:
>   
>> elsa wrote:
>>     
>>> Hi guys,
>>>
>>> I've got a problem with my program, in that the code just takes too
>>> long to run. Here's what I'm doing. If anyone has any tips, they'd be
>>> much appreciated!
>>>
>>> So, say I have a list of lists that looks something like this (I'm
>>> using a list of lists, rather than a list of tuples, as I need it to
>>> be mutable):
>>>
>>> myList = [[786,0],[45, 1],[673,1],...................[23,46]]
>>>
>>> there are enough entries in the outer list, that the sum of myList[i]
>>> [0] across all i could be as high as 10^7.
>>>
>>> Now, what I need to do is randomly choose one myList[i], however the
>>> distribution of my random choice needs to be proportional to the
>>> values of myList[i][0]. So, for this list above, I'd have a much
>>> higher chance of choosing myList[0] than myList[1].
>>>
>>> Here is how I'm doing it at the moment:
>>>
>>> def chooseI(myList):
>>>     mySum=0
>>>     choice = random.choice(range(1,sum([i[0] for i in myList])+1))
>>>     for i in range(len(myList)):
>>>         mySum+=myList[i][0]
>>>         if mySum>=choice:
>>>             return i
>>>             break
>>>
>>> This works just fine if sum([i[0] for i in myList]) is less than
>>> 10,000, say. However if its up around 10^7, the whole thing crashes.
>>> Is there a faster way of doing this, that doesn't involve as many
>>> computational steps?
>>>
>>> Thanks!
>>>
>>> elsa
>>>
>>>   
>>>       
>> "the whole thing crashes"  -- give us the specific crash you see,
>> including the stack trace.  "Crash" is far too vague.
>>
>> At first glance you could be running out of memory.  But not with the
>> values you're quoting.  If a typical average value for i[0] is 50,
>> that's only 200k elements.  Anyway, you could print out the len(myList)
>> to see how big the list is.
>>
>> There are quicker ways to do the sum, but the first thing to change is
>> the random.choice() stuff.  Once you have a sum, you can use
>> random.randint() on it directly.  No need to make a range list.
>>
>> Another approach might be to build a new list with a running sum in it. 
>> Then after doing the randint() on the last (highest) item, you can do a
>> bisect.bisect on that list.  The index that returns will be your return
>> value.
>>
>>     
> Your approach seems to  validate an informal impression I had that with
> N choices one ought to be able to build a binary tree where each
> decision went to left or right according to whether a number drawn from
> a linear [0,1) distribution was greater that some arbitrary margin, or not.
>
> There are then arithmetical questions of exactly how to arrive at the
> correct threshold values for each bifurcation.  If the most probable
> paths are deliberately placed at the top of the tree then the most
> frequent cases will be fastest to generate, also.
>
> regards
>   
If the list will remain the same, it may be worth pre-computing the 
running sum into it, and then the function to get a random value just 
has to do the bisect.  But as you say, if the values are not evenly 
distributed, a binary split (which is what bisect does) might not give 
the best average time.

I doubt if it would help for small lists (Python code would be slower 
than the presumably C code in bisect), but perhaps for larger lists, an 
index something like this would be better:

Assuming the list had 40 items, and the sum was 1000 for the sake of 
example.   you'd build a tree with 500 as the top node.  that would 
"point" not to 20, but to whatever index had a value that was closest to 
500.  Then you'd build child nodes that similarly split each portion.  
And you'd repeat it till the remaining size was fairly small, at which 
point you'd use standard bisect logic.

On a different tack, someone could probably improve on the bisect() 
function, for cases where the list is a sorted list of numbers.  
Typically a binary algorithm is used because a comparison function gives 
either less than, equal, or greater.  But if you know they are all 
well-behaved numbers, you could do a formula to narrow down the choices 
faster.

Let's say that at some point you know the value is between items[5] and 
items[105], and the values respectively are 40 and 140.  You're 
searching for a value of 50.  Instead of trying items[55], you do a 
scale factor, and do   5 + ( 105-5 ) * (50-40) / (140-40).  So you'd try 
element 15 next.

All these values are plus/minus 1, but I think I've come close to an 
algorithm.  Kind of an analogy to Newton's method.

DaveA