help with making my code more efficient

Fri Dec 21 22:19:37 EST 2012

On 12/21/2012 03:36 PM, Larry.Martell at gmail.com wrote:
> On Friday, December 21, 2012 10:57:19 AM UTC-7, Larry.... at gmail.com wrote:
>> <snip>
>> Didn't know about bisect. Thanks. I thought it would be my savior for sure. But unfortunaly when I added that, it blows up with out of memory. 
> The out of memory error had nothing to do with using bisect. I had introduced a typo that I really though would have caused a variable referenced before assignment error. But it did not do that, and instead somehow caused all the memory in my machine to get used up. When I fixed that, it worked really well with bisect. The code that was taking 2 hours was down to 20 minutes, and even better, a query that was taking 40 minutes was down to 8 seconds. 
>
> Thanks very much for all your help. 
>
>

Congratulations. But you're not done. A fourfold improvement isn't
nearly as much as I'd expect.  When you go from a n-squared algorithm to
an n-log-n, for n of 600k, I'd be disappointed in less than a 100 fold
improvement.  Assuming we're talking just about time spent in the cdata
 = list-comprehension list.

It's also probably possible to simplify, and get some speedup from other
parts of the code other than that one loop.  For example, if the bisect
is really working right, it's probably unnecessary to even copy the
original cdata.  You said it was sorted.  So bisect it directly, and
skip the messageTimes intermediate data structure.  It may not help, but
i suspect it will.

>
>> This was the code I had:
>>  
>> times = messageTimes[tup[0]]
>>
>> le = bisect.bisect_right(times, tup[1])
>>
>> ge = bisect.bisect_left(times, tup[1])
>>
>> return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)

I'm stymied making further suggestions to this fragment.  Without seeing
the changes you apparently made to messageTimes, I can't even be sure
this is equivalent.  And I wonder if even looking up tup[1] is
worthwhile, since your caller already knows the index in cdata.  If you
used cdata directly, you might not need a sort at all.

I wish you could post some sample data, and identify which records are
intended to be True.  Obviously you could simplify, and just use ints
where it's using datetimes here.  But if your rule is simply accept all
records that come in a cluster with no delta more than a threshold, then
you don't even need any search at all.  Just pass the index in, and
compare the datetime with its predecessor and successor.

Could we see the whole fragment as you originally had it, but with the
optimizations you've done so far?  The more I look at that determine()
function, the more futile it seems.  I just don't understand what the
criteria are supposed to be.  Your list-comp loops through all of cdata,
and for each item, passes that item to determine().  determine() loops
through a copy of cdata, checking to see if any of them is within tdiff
of tup.  But won't that always return True, since you'll encounter the
record and compare it to itself?

-- 

DaveA