help with making my code more efficient

Thu Dec 20 21:49:43 EST 2012

On 12/20/2012 09:39 PM, Mitya Sirenef wrote:
> On 12/20/2012 08:46 PM, Larry.Martell at gmail.com wrote:
>> On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote:
>>> On 12/20/2012 07:19 PM, Larry.Martell at gmail.com wrote:
>>>
>>>> I have a list of tuples that contains a tool_id, a time, and a 
>>>> message. I want to select from this list all the elements where the 
>>>> message matches some string, and all the other elements where the 
>>>> time is within some diff of any matching message for that tool.
>>>> Here is how I am currently doing this:
>>> No, it's not.  This is a fragment of code, without enough clues as to
>>>
>>> what else is going.  We can guess, but that's likely to make a mess.
>> Of course it's a fragment - it's part of a large program and I was 
>> just showing the relevant parts.
>>
>>> First question is whether this code works exactly correctly?
>> Yes, the code works. I end up with just the rows I want.
>>
>>> Are you only concerned about speed, not fixing features?
>> Don't know what you mean by 'fixing features'. The code does what I 
>> want, it just takes too long.
>>
>>> As far as I can tell, the logic that includes the time comparison is 
>>> bogus.
>> Not at all.
>>
>>> You don't do  anything there to worry about the value of tup[2], 
>>> just whether some
>>> item has a nearby time.  Of course, I could misunderstand the spec.
>> The data comes from a database. tup[2] is a datetime column. tdiff 
>> comes from a datetime.timedelta()
>>
>>> Are you making a global called 'self' ? That name is by convention only
>>> used in methods to designate the instance object.  What's the attribute
>>> self?
>> Yes, self is my instance object. self.message contains the string of 
>> interest that I need to look for.
>>
>>> Can cdata have duplicates, and are they significant?
>> No, it will not have duplicates.
>>
>>> Are you including  the time building that as part of your 2 hour 
>>> measurement?
>> No, the 2 hours is just the time to run the
>>
>> cdata[:] = [tup for tup in cdata if determine(tup)]
>>
>>> Is the list sorted in any way?
>> Yes, the list is sorted by tool and datetime.
>>
>>> Chances are your performance bottleneck is the doubly-nested loop.  You
>>> have a list comprehension at top-level code, and inside it calls a
>>> function that also loops over the 600,000 items.  So the inner loop 
>>> gets
>>> executed 360 billion times.  You can cut this down drastically by some
>>> judicious sorting, as well as by having a map of lists, where the 
>>> map is
>>> keyed by the tool.
>> Thanks. I will try that.
>>
>>>> # record time for each message matching the specified message for 
>>>> each tool
>>>> messageTimes = {}
>>> You're building a dictionary;  are you actually using the value (1), or
>>>   is only the key relevant?
>> Only the keys.
>>
>>> A set is a dict without a value.
>> Yes, I could use a set, but I don't think that would make it 
>> measurably faster.
>>
>>> But more mportantly, you never look up anything in this dictionary.  
>>> So why
>>> isn't it a list?  For that matter, why don't you just use the
>>> messageTimes list?
>> Yes, it could be a list too.
>>>> for row in cdata:   # tool, time, message
>>>>      if self.message in row[2]:
>>>>          messageTimes[row[0], row[1]] = 1
>>>> # now pull out each message that is within the time diff for each 
>>>> matched message
>>>> # as well as the matched messages themselves
>>>> def determine(tup):
>>>>      if self.message in tup[2]: return True      # matched message
>>>>      for (tool, date_time) in messageTimes:
>>>>          if tool == tup[0]:
>>>>              if abs(date_time-tup[1]) <= tdiff:
>>>>                 return True
>>>>      return False
>>>>          cdata[:] = [tup for tup in cdata if determine(tup)]
>>>
>>>
>>> As the code exists, there's no need to copy the list.  Just do a simple
>>> bind.
>> This statement is to remove the items from cdata that I don't want. I 
>> don't know what you mean by bind. I'm not familiar with that python 
>> function.
>>
>>>
>>>
>>>> This code works, but it takes way too long to run - e.g. when cdata 
>>>> has 600,000 elements (which is typical for my app) it takes 2 hours 
>>>> for this to run.
>>>> Can anyone give me some suggestions on speeding this up?
>
>
> This code probably is not faster but it's simpler and may be easier 
> for you to work with
> to experiment with speed-improving changes:
>
>
> diffrng = 1
>
> L = [
>      # id, time, string
>      (1, 5, "ok"),
>      (1, 6, "ok"),
>      (1, 7, "no"),
>      (1, 8, "no"),
>      ]
>
> match_times = [t[1] for t in L if "ok" in t[2]]
>
> def in_range(timeval):
>     return bool( min([abs(timeval-v) for v in match_times]) <= diffrng )
>
> print([t for t in L if in_range(t[1])])
>
>
> But it really sounds like you could look into optimizing the db
> query and db indexes, etc.
>
>

Actually, it might be slower.. this version of in_range should be better:

def in_range(timeval):
     return any( abs(timeval-v) <= diffrng for v in match_times )

-- 
Lark's Tongue Guide to Python: http://lightbird.net/larks/