help with making my code more efficient

Fri Dec 21 23:47:28 EST 2012

On Friday, December 21, 2012 8:19:37 PM UTC-7, Dave Angel wrote:
> On 12/21/2012 03:36 PM, Larry.Martell at gmail.com wrote:
> 
> > On Friday, December 21, 2012 10:57:19 AM UTC-7, Larry.... at gmail.com wrote:
> 
> >> <snip>
> 
> >> Didn't know about bisect. Thanks. I thought it would be my savior for sure. But unfortunaly when I added that, it blows up with out of memory. 
> 
> > The out of memory error had nothing to do with using bisect. I had introduced a typo that I really though would have caused a variable referenced before assignment error. But it did not do that, and instead somehow caused all the memory in my machine to get used up. When I fixed that, it worked really well with bisect. The code that was taking 2 hours was down to 20 minutes, and even better, a query that was taking 40 minutes was down to 8 seconds. 
> 
> >
> 
> > Thanks very much for all your help. 
> 
> Congratulations. But you're not done. A fourfold improvement isn't
> nearly as much as I'd expect.  When you go from a n-squared algorithm to
> an n-log-n, for n of 600k, I'd be disappointed in less than a 100 fold 
> improvement.  Assuming we're talking just about time spent in the cdata 
>  = list-comprehension list.
> 
> It's also probably possible to simplify, and get some speedup from other
> parts of the code other than that one loop.  For example, if the bisect
> is really working right, it's probably unnecessary to even copy the
> original cdata.  You said it was sorted.  So bisect it directly, and
> skip the messageTimes intermediate data structure.  It may not help, but
> i suspect it will.
> >
> 
> >> This was the code I had:
> >> 
> >> times = messageTimes[tup[0]]
> 
> >>
> 
> >> le = bisect.bisect_right(times, tup[1])
> >> ge = bisect.bisect_left(times, tup[1])
> >> return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)
> 
> 
> 
> I'm stymied making further suggestions to this fragment.  Without seeing
> the changes you apparently made to messageTimes, I can't even be sure 
> this is equivalent.  And I wonder if even looking up tup[1] is 
> worthwhile, since your caller already knows the index in cdata.  If you
> used cdata directly, you might not need a sort at all.
>  
> I wish you could post some sample data, and identify which records are 
> intended to be True.  Obviously you could simplify, and just use ints 
> where it's using datetimes here.  But if your rule is simply accept all
> records that come in a cluster with no delta more than a threshold, then
> you don't even need any search at all.  Just pass the index in, and
> compare the datetime with its predecessor and successor.
> 
> Could we see the whole fragment as you originally had it, but with the 
> optimizations you've done so far?  The more I look at that determine() 
> function, the more futile it seems.  I just don't understand what the
> criteria are supposed to be.  Your list-comp loops through all of cdata,
> and for each item, passes that item to determine().  determine() loops
> through a copy of cdata, checking to see if any of them is within tdiff
> of tup.  But won't that always return True, since you'll encounter the
> record and compare it to itself?

I think you're misunderstanding what I need to do. I have a set of rows from the database with tool, time, and message. The user has specified a string and a time threshold. From that set of rows I need to return all the rows that contain the user's string and all the other rows that match the tool from the matched rows and have a time within the threshold.

cdata has all the rows. messageTimes has the times of all the matched messages, keyed by tool. In determine() I don't look though cdata - it gets one element from cdata and I see if that should be selected because it either matches the user's message, or it's within the threshold of one that did match.

Here's my original code:

# record time for each message matching the specified message for each tool 
messageTimes = {} 
for row in cdata:   # tool, time, message 
    if self.message in row[2]: 
        messageTimes[row[0], row[1]] = 1 

# now pull out each message that is within the time diff for each matched message 
# as well as the matched messages themselves 

def determine(tup): 
    if self.message in tup[2]: return True      # matched message 

    for (tool, date_time) in messageTimes: 
        if tool == tup[0]: 
            if abs(date_time-tup[1]) <= tdiff: 
               return True 

    return False 

cdata[:] = [tup for tup in cdata if determine(tup)] 

Here's the code now:

       # Parse data and make a list of the time for each message matching the specified message for each tool
        messageTimes = defaultdict(list)    # a dict with sorted lists

        for row in cdata:   # tool, time, message
            if self.message in row[2]:
                messageTimes[row[0]].append(row[1])

        # now pull out each message that is within the time context for each matched message
        # as well as the matched messages themselves

        # return true is we should keep this message
        def determine(tup):
            if self.message in tup[2]: return True     # matched message
            if seconds == 0: return False                # no time context specified

            times = messageTimes[tup[0]]              # get the list of matched messages for this tool

            le = bisect.bisect_right(times, tup[1])   # find time less than or equal to tup[1]
            ge = bisect.bisect_left(times, tup[1])    # find time greater then to equal to tup[1]
            return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)

        cdata = [tup for tup in cdata if determine(tup)]

In my test case, cdata started with 600k rows, 30k matched the users string, and a total of 110k needed to be returned (which is what cdata ended up with) - the 30k that matched the string, and 80k that were within the time threshold. 

I think the point you may have missed is the tool - I only return a row if it's the same tool as a matched message and within the threshold. 

I hope I've explained this better. Thanks again.