[Tutor] Please look at my wordFrequency.py
Kent Johnson
kent37 at tds.net
Tue Oct 11 18:36:59 CEST 2005
Dick Moores wrote:
> Kent Johnson wrote at 03:24 10/11/2005:
>
>>Dick Moores wrote:
>>
>>>(Execution took about 30 sec. with my computer.)
>>
>>That's way too long
>
>
> How long would you expect? I've already made some changes but haven't
> seen the time change much.
A couple of seconds at most, unless you are running it on some dog computer. It's just not that much text and you should be able to process it in a couple of passes at most.
What changes have you made? Several changes already posted should have a noticable effect, I think. What is your current code?
>>>5) Ideally, abbreviations that end in a period, such as U.N., e.g.,
>>
>>i.e.,
>>
>>>viz. op. cit., Mr. (Am. E.), etc., should not be stripped of their final
>>>periods (whereas other words that end a sentence SHOULD be stripped). I
>>>tried making and using a Python list of these, but it was too tough to
>>>write the code to use it. Any ideas?
>>
>>You should be able to do this with regular expressions or searching in
>>the word. You want to test for a word that ends with a period but
>>doesn't include any periods. Something like
>>if word.endswith('.') and '.' not in word[:-1]:
>> word = word[:-1]
>
>
> Nice! That takes care of U.N., e.g., i.e., but not viz., op. cit., or Mr.
Ah, right. I don't know how you could handle that except with a dictionary. At least they will only appear in the word list once, without the trailing period.
>>Other notes:
>>Use re.split() to do all the splits at once. Something like
>> L = re.split(r'\s+|--|/', textAsString)
>
>
> Don't understand this yet. I'll work on it.
OK, it's a regular expression that will match either
\s+ one or more white space e.g. space, tab, newline
-- a hyphen
/ a slash
re.split() then splits the string on each match.
>
>
>>#remove empty elements in L
>>while "" in L:
>> L.remove("")
>>The above iterates L twice for each empty word!
>
>
> I don't get the twice. Could you spell it out, please?
the test /"" in L/ searches the list for an empty string - that's one
L.remove("") searches the list again for the empty string, then removes it
>
>
>>The remove() calls are expensive too because the remaining elements of L
>>must be shifted down. Do the whole thing in one pass over L with
>> L = [ w for w in L if w ]
>>You only need to remove empty elements once, when the rest of the
>>processing is done.
>
>
> Got it. But using this doesn't seem to make much difference in the time.
>
> Also, I'm puzzled that whether or not psyco is employed makes no
> difference in the time. Can you explain why?
My guess is it's because you have so many O(n^2) elements in the code. You have to get your algorithm to be O(n).
>
>
>>for e in saveRemovedForLaterL:
>> L.append(e)
>>could be
>>L.extend(e)
>
>
> Are you recommending L.extend(e), or is it just another way to do it?
Recommending. Look for ways to eliminate loops. If you can't eliminate them, move them into C code in the runtime, which is what this one does.
>
> Thanks very much for your help, Kent.
No problem!
Kent
>
> Dick
>
> _______________________________________________
> Tutor maillist - Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
>
More information about the Tutor
mailing list