[Tutor] Please look at my wordFrequency.py

Tue Oct 11 18:36:59 CEST 2005

Dick Moores wrote:
> Kent Johnson wrote at 03:24 10/11/2005:
> 
>>Dick Moores wrote:
>>
>>>(Execution took about 30 sec. with my computer.)
>>
>>That's way too long
> 
> 
> How long would you expect? I've already made some changes but haven't 
> seen the time change much.

A couple of seconds at most, unless you are running it on some dog computer. It's just not that much text and you should be able to process it in a couple of passes at most.

What changes have you made? Several changes already posted should have a noticable effect, I think. What is your current code?

>>>5) Ideally, abbreviations that end in a period, such as U.N., e.g., 
>>
>>i.e.,
>>
>>>viz. op. cit., Mr. (Am. E.), etc., should not be stripped of their final
>>>periods (whereas other words that end a sentence SHOULD be stripped). I
>>>tried making and using a Python list of these, but it was too tough to
>>>write the code to use it. Any ideas?
>>
>>You should be able to do this with regular expressions or searching in 
>>the word. You want to test for a word that ends with a period but 
>>doesn't include any periods. Something like
>>if word.endswith('.') and '.' not in word[:-1]:
>>  word = word[:-1]
> 
> 
> Nice! That takes care of U.N., e.g., i.e., but not viz., op. cit., or Mr.

Ah, right. I don't know how you could handle that except with a dictionary. At least they will only appear in the word list once, without the trailing period.

>>Other notes:
>>Use re.split() to do all the splits at once. Something like
>>  L = re.split(r'\s+|--|/', textAsString)
> 
> 
> Don't understand this yet. I'll work on it.

OK, it's a regular expression that will match either
 \s+ one or more white space e.g. space, tab, newline
 -- a hyphen
 / a slash

re.split() then splits the string on each match.
> 
> 
>>#remove empty elements in L
>>while "" in L:
>>    L.remove("")
>>The above iterates L twice for each empty word!
> 
> 
> I don't get the twice. Could you spell it out, please?

the test /"" in L/ searches the list for an empty string - that's one
L.remove("") searches the list again for the empty string, then removes it
> 
> 
>>The remove() calls are expensive too because the remaining elements of L 
>>must be shifted down. Do the whole thing in one pass over L with
>>    L = [ w for w in L if w ]
>>You only need to remove empty elements once, when the rest of the 
>>processing is done.
> 
> 
> Got it. But using this doesn't seem to make much difference in the time.
> 
> Also, I'm puzzled that whether or not psyco is employed makes no 
> difference in the time. Can you explain why?

My guess is it's because you have so many O(n^2) elements in the code. You have to get your algorithm to be O(n).

> 
> 
>>for e in saveRemovedForLaterL:
>>    L.append(e)
>>could be
>>L.extend(e)
> 
> 
> Are you recommending L.extend(e), or is it just another way to do it?

Recommending. Look for ways to eliminate loops. If you can't eliminate them, move them into C code in the runtime, which is what this one does.

> 
> Thanks very much for your help, Kent.

No problem!

Kent
> 
> Dick 
> 
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
> 
>