[Tutor] Most common words in a text file

Cameron Simpson cs at cskk.id.au
Sat Sep 30 17:54:32 EDT 2017


On 30Sep2017 22:42, Sri G. <srigalibe at gmail.com> wrote:
>I’ve written the code below for finding the most common words in a text
>file that has about 1.1 million words. It's working fine, but I believe
>there is always room for improvement.
>
>When run, the function in the script gets a text file from the command-line
>argument sys.argv[1], opens the file in read mode, converts the text to
>lowercase, makes a list of words from the text after removing any
>whitespaces or empty strings, and stores the list elements as dictionary
>keys and values in a collections.Counter object. Finally, it returns a
>dictionary of the most common words and their counts. The
>words.most_common() method gets its argument from the optional top
> parameter.

Thank you, this is a nice clear explaination.

>import collections

It is common to import specific names from modules; whether you just import the 
module or specific names depends on your needs. However, in this case, instead 
of importing "collections" and creating a counter with:

  collections.Counter()

I would import the name "Counter" and use it like this:

  from collections import Counter
  ...
  ... Counter() ...

>def find_most_common_words(textfile, top=10):

It is often good form to externalise the "10" as a constant (Python doesn't 
have constants per se, but it has a convention for value which would be 
constants in other languages):

  DEFAULT_TOP_SIZE = 10

  def find_most_common_words(textfile, top=DEFAULT_TOP_SIZE):

In a library of functions you can then put all the "constants" up the top where 
they can be seen or modified, and reference then lower down where the functions 
are defined.

>    ''' Returns the most common words in the textfile.'''
>    textfile = open(textfile)

You shouldn't reuse "textfile" as a file. Keep the filename value distinct. I 
tend to use names like "textpath" for the filename and "textfile" for the open 
file, eg:

  textfile = open(textpath)

Of course you change the parameter name to match.

>    textfile = open(textfile)
>    text = textfile.read().lower()
>    textfile.close()

The preferred way of opening files looks like this:

  with open(textpath) as textfile:
    ... work on the data from textfile ...
    ...
    ...

This automatically closes textfile on exiting the "with" suite, even if an 
exception occurs (which is very handy in situations where the exception might 
be caught in outer code - you never miss closing the file). It is also shorter 
and more readable.

>    text = textfile.read().lower()

This code reads the entire text into memory. It is more frugal to read the file 
progressively, which is good because it scales to files of any size. This 
concern is prominent in my mind because I started programming on systems with a 
16 bit address space, so 64K max including the program itself _and_ the OS. But 
it applies even today when many data files are very large.

So one might process file file as lines of text:

  C = Counter()
  for line in textfile:
    words = line.lower().split()
    C.update(words)

This avoids the need for more than a single line to be stored in memory at any 
given time.

Cheers,
Cameron Simpson <cs at cskk.id.au> (formerly cs at zip.com.au)


More information about the Tutor mailing list