help with making my code more efficient
Larry.Martell at gmail.com
Larry.Martell at gmail.com
Fri Dec 21 23:47:28 EST 2012
On Friday, December 21, 2012 8:19:37 PM UTC-7, Dave Angel wrote:
> On 12/21/2012 03:36 PM, Larry.Martell at gmail.com wrote:
>
> > On Friday, December 21, 2012 10:57:19 AM UTC-7, Larry.... at gmail.com wrote:
>
> >> <snip>
>
> >> Didn't know about bisect. Thanks. I thought it would be my savior for sure. But unfortunaly when I added that, it blows up with out of memory.
>
> > The out of memory error had nothing to do with using bisect. I had introduced a typo that I really though would have caused a variable referenced before assignment error. But it did not do that, and instead somehow caused all the memory in my machine to get used up. When I fixed that, it worked really well with bisect. The code that was taking 2 hours was down to 20 minutes, and even better, a query that was taking 40 minutes was down to 8 seconds.
>
> >
>
> > Thanks very much for all your help.
>
> Congratulations. But you're not done. A fourfold improvement isn't
> nearly as much as I'd expect. When you go from a n-squared algorithm to
> an n-log-n, for n of 600k, I'd be disappointed in less than a 100 fold
> improvement. Assuming we're talking just about time spent in the cdata
> = list-comprehension list.
>
> It's also probably possible to simplify, and get some speedup from other
> parts of the code other than that one loop. For example, if the bisect
> is really working right, it's probably unnecessary to even copy the
> original cdata. You said it was sorted. So bisect it directly, and
> skip the messageTimes intermediate data structure. It may not help, but
> i suspect it will.
> >
>
> >> This was the code I had:
> >>
> >> times = messageTimes[tup[0]]
>
> >>
>
> >> le = bisect.bisect_right(times, tup[1])
> >> ge = bisect.bisect_left(times, tup[1])
> >> return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)
>
>
>
> I'm stymied making further suggestions to this fragment. Without seeing
> the changes you apparently made to messageTimes, I can't even be sure
> this is equivalent. And I wonder if even looking up tup[1] is
> worthwhile, since your caller already knows the index in cdata. If you
> used cdata directly, you might not need a sort at all.
>
> I wish you could post some sample data, and identify which records are
> intended to be True. Obviously you could simplify, and just use ints
> where it's using datetimes here. But if your rule is simply accept all
> records that come in a cluster with no delta more than a threshold, then
> you don't even need any search at all. Just pass the index in, and
> compare the datetime with its predecessor and successor.
>
> Could we see the whole fragment as you originally had it, but with the
> optimizations you've done so far? The more I look at that determine()
> function, the more futile it seems. I just don't understand what the
> criteria are supposed to be. Your list-comp loops through all of cdata,
> and for each item, passes that item to determine(). determine() loops
> through a copy of cdata, checking to see if any of them is within tdiff
> of tup. But won't that always return True, since you'll encounter the
> record and compare it to itself?
I think you're misunderstanding what I need to do. I have a set of rows from the database with tool, time, and message. The user has specified a string and a time threshold. From that set of rows I need to return all the rows that contain the user's string and all the other rows that match the tool from the matched rows and have a time within the threshold.
cdata has all the rows. messageTimes has the times of all the matched messages, keyed by tool. In determine() I don't look though cdata - it gets one element from cdata and I see if that should be selected because it either matches the user's message, or it's within the threshold of one that did match.
Here's my original code:
# record time for each message matching the specified message for each tool
messageTimes = {}
for row in cdata: # tool, time, message
if self.message in row[2]:
messageTimes[row[0], row[1]] = 1
# now pull out each message that is within the time diff for each matched message
# as well as the matched messages themselves
def determine(tup):
if self.message in tup[2]: return True # matched message
for (tool, date_time) in messageTimes:
if tool == tup[0]:
if abs(date_time-tup[1]) <= tdiff:
return True
return False
cdata[:] = [tup for tup in cdata if determine(tup)]
Here's the code now:
# Parse data and make a list of the time for each message matching the specified message for each tool
messageTimes = defaultdict(list) # a dict with sorted lists
for row in cdata: # tool, time, message
if self.message in row[2]:
messageTimes[row[0]].append(row[1])
# now pull out each message that is within the time context for each matched message
# as well as the matched messages themselves
# return true is we should keep this message
def determine(tup):
if self.message in tup[2]: return True # matched message
if seconds == 0: return False # no time context specified
times = messageTimes[tup[0]] # get the list of matched messages for this tool
le = bisect.bisect_right(times, tup[1]) # find time less than or equal to tup[1]
ge = bisect.bisect_left(times, tup[1]) # find time greater then to equal to tup[1]
return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)
cdata = [tup for tup in cdata if determine(tup)]
In my test case, cdata started with 600k rows, 30k matched the users string, and a total of 110k needed to be returned (which is what cdata ended up with) - the 30k that matched the string, and 80k that were within the time threshold.
I think the point you may have missed is the tool - I only return a row if it's the same tool as a matched message and within the threshold.
I hope I've explained this better. Thanks again.
More information about the Python-list
mailing list