[Tutor] Most common words in a text file

Mats Wichmann mats at wichmann.us
Sun Oct 1 10:23:42 EDT 2017


On 09/30/2017 11:12 AM, Sri G. wrote:
> I'm learning programming with Python.
> 
> I’ve written the code below for finding the most common words in a text
> file that has about 1.1 million words. It's working fine, but I believe
> there is always room for improvement.
> 
> When run, the function in the script gets a text file from the command-line
> argument sys.argv[1], opens the file in read mode, converts the text to
> lowercase, makes a list of words from the text after removing any
> whitespaces or empty strings, and stores the list elements as dictionary
> keys and values in a collections.Counter object. Finally, it returns a
> dictionary of the most common words and their counts. The
> words.most_common() method gets its argument from the optional top
>  parameter.
> 
> import sysimport collections
> def find_most_common_words(textfile, top=10):
>     ''' Returns the most common words in the textfile.'''
> 
>     textfile = open(textfile)
>     text = textfile.read().lower()
>     textfile.close()
>     words = collections.Counter(text.split()) # how often each word appears
> 
>     return dict(words.most_common(top))
> 
> filename = sys.argv[1]
> top_five_words = find_most_common_words(filename, 5)
> 
> I need your comments please.

Others have made some pertinent comments already.

How much you spend time to "improve" a bit of code depends on what
you're going to do with it.  If you've solved your problem, and it's a
one-shot: don't worry much about it. Nothing wrong with a bit of
throwaway code (although things do sometimes take on a life much longer
than you intended, I can say from experience!!!)

I'd ask a question or two to think about:  first off, if you know the
intended use of this function always will be to get a "top 10 list" -
then why convert it back to a dict (unsorted) to return after
Counter.most_common() has just given you a sorted list?  You're most
likely going to have to take your top_five_words dictionary and turn it
back into something sorted to report back out on the counts.

Second, if your function is likely to be called from many places in your
code, is it too specialized?  Can you design it as a more general API
that could be used for other things than this specific purpose?  For
example, perhaps you just want to return the Counter instance without
the further processing of "most_common", and let the client decide what
it wants to do with that object.



More information about the Tutor mailing list