[Tutor] Please look at my wordFrequency.py

Tue Oct 11 19:37:44 CEST 2005

Kent Johnson wrote:
> Dick Moores wrote:
> 
>> Kent Johnson wrote at 03:24 10/11/2005:
>>
>>> Dick Moores wrote:
>>>
>>>> (Execution took about 30 sec. with my computer.)
>>>
>>>
>>> That's way too long
>>
>>
>>
>> How long would you expect? I've already made some changes but haven't 
>> seen the time change much.
> 
> 
> A couple of seconds at most, unless you are running it on some dog 
> computer. It's just not that much text and you should be able to process 
> it in a couple of passes at most.

OK I couldn't resist. I took your program and ran it on my computer - took about 38 seconds and got the same results as you. Then I made the changes I outlined, and a few other similar ones, and got it down to 34 secs. Finally I made the change suggested by John Fouhy - to accumulate the counts in a dict - and the time went down to 0.23 seconds.

>> Also, I'm puzzled that whether or not psyco is employed makes no 
>> difference in the time. Can you explain why?
> 
> 
> My guess is it's because you have so many O(n^2) elements in the code. 
> You have to get your algorithm to be O(n).

In particular this code:
for word in L:
    k = L.count(word)
    if (k,word) not in F:
        F.append((k,word))

L.count() has to scan through the entire list (L) looking for a match with each word. So for each word, you are making 26140 string compares. The total number of compares is 26140*26140 or 683,299,600. That's a lot!
Then, for each word, you scan F for a match. Now you are doing tuple compares. The number of compares will increase as the length of F, but overall it will be about 26140*3700/2 or 48,359,000 compares.

Compare this to the dictionary version which just iterates L once, doing a dictionary lookup and write for each word.

The reason psyco doesn't make much difference is because all the time is spent in list.count() which is already C code.

Kent