help with making my code more efficient
Mitya Sirenef
msirenef at lightbird.net
Thu Dec 20 21:49:43 EST 2012
On 12/20/2012 09:39 PM, Mitya Sirenef wrote:
> On 12/20/2012 08:46 PM, Larry.Martell at gmail.com wrote:
>> On Thursday, December 20, 2012 6:17:04 PM UTC-7, Dave Angel wrote:
>>> On 12/20/2012 07:19 PM, Larry.Martell at gmail.com wrote:
>>>
>>>> I have a list of tuples that contains a tool_id, a time, and a
>>>> message. I want to select from this list all the elements where the
>>>> message matches some string, and all the other elements where the
>>>> time is within some diff of any matching message for that tool.
>>>> Here is how I am currently doing this:
>>> No, it's not. This is a fragment of code, without enough clues as to
>>>
>>> what else is going. We can guess, but that's likely to make a mess.
>> Of course it's a fragment - it's part of a large program and I was
>> just showing the relevant parts.
>>
>>> First question is whether this code works exactly correctly?
>> Yes, the code works. I end up with just the rows I want.
>>
>>> Are you only concerned about speed, not fixing features?
>> Don't know what you mean by 'fixing features'. The code does what I
>> want, it just takes too long.
>>
>>> As far as I can tell, the logic that includes the time comparison is
>>> bogus.
>> Not at all.
>>
>>> You don't do anything there to worry about the value of tup[2],
>>> just whether some
>>> item has a nearby time. Of course, I could misunderstand the spec.
>> The data comes from a database. tup[2] is a datetime column. tdiff
>> comes from a datetime.timedelta()
>>
>>> Are you making a global called 'self' ? That name is by convention only
>>> used in methods to designate the instance object. What's the attribute
>>> self?
>> Yes, self is my instance object. self.message contains the string of
>> interest that I need to look for.
>>
>>> Can cdata have duplicates, and are they significant?
>> No, it will not have duplicates.
>>
>>> Are you including the time building that as part of your 2 hour
>>> measurement?
>> No, the 2 hours is just the time to run the
>>
>> cdata[:] = [tup for tup in cdata if determine(tup)]
>>
>>> Is the list sorted in any way?
>> Yes, the list is sorted by tool and datetime.
>>
>>> Chances are your performance bottleneck is the doubly-nested loop. You
>>> have a list comprehension at top-level code, and inside it calls a
>>> function that also loops over the 600,000 items. So the inner loop
>>> gets
>>> executed 360 billion times. You can cut this down drastically by some
>>> judicious sorting, as well as by having a map of lists, where the
>>> map is
>>> keyed by the tool.
>> Thanks. I will try that.
>>
>>>> # record time for each message matching the specified message for
>>>> each tool
>>>> messageTimes = {}
>>> You're building a dictionary; are you actually using the value (1), or
>>> is only the key relevant?
>> Only the keys.
>>
>>> A set is a dict without a value.
>> Yes, I could use a set, but I don't think that would make it
>> measurably faster.
>>
>>> But more mportantly, you never look up anything in this dictionary.
>>> So why
>>> isn't it a list? For that matter, why don't you just use the
>>> messageTimes list?
>> Yes, it could be a list too.
>>>> for row in cdata: # tool, time, message
>>>> if self.message in row[2]:
>>>> messageTimes[row[0], row[1]] = 1
>>>> # now pull out each message that is within the time diff for each
>>>> matched message
>>>> # as well as the matched messages themselves
>>>> def determine(tup):
>>>> if self.message in tup[2]: return True # matched message
>>>> for (tool, date_time) in messageTimes:
>>>> if tool == tup[0]:
>>>> if abs(date_time-tup[1]) <= tdiff:
>>>> return True
>>>> return False
>>>> cdata[:] = [tup for tup in cdata if determine(tup)]
>>>
>>>
>>> As the code exists, there's no need to copy the list. Just do a simple
>>> bind.
>> This statement is to remove the items from cdata that I don't want. I
>> don't know what you mean by bind. I'm not familiar with that python
>> function.
>>
>>>
>>>
>>>> This code works, but it takes way too long to run - e.g. when cdata
>>>> has 600,000 elements (which is typical for my app) it takes 2 hours
>>>> for this to run.
>>>> Can anyone give me some suggestions on speeding this up?
>
>
> This code probably is not faster but it's simpler and may be easier
> for you to work with
> to experiment with speed-improving changes:
>
>
> diffrng = 1
>
> L = [
> # id, time, string
> (1, 5, "ok"),
> (1, 6, "ok"),
> (1, 7, "no"),
> (1, 8, "no"),
> ]
>
> match_times = [t[1] for t in L if "ok" in t[2]]
>
> def in_range(timeval):
> return bool( min([abs(timeval-v) for v in match_times]) <= diffrng )
>
> print([t for t in L if in_range(t[1])])
>
>
> But it really sounds like you could look into optimizing the db
> query and db indexes, etc.
>
>
Actually, it might be slower.. this version of in_range should be better:
def in_range(timeval):
return any( abs(timeval-v) <= diffrng for v in match_times )
--
Lark's Tongue Guide to Python: http://lightbird.net/larks/
More information about the Python-list
mailing list