[Tutor] Simple counter to determine frequencies of words in a document

Steven D'Aprano steve at pearwood.info
Sat Nov 20 12:10:58 CET 2010


Josep M. Fontana wrote:

> def countWords(a_list):
>     words = {}
>     for i in range(len(a_list)):
>         item = a_list[i]
>         count = a_list.count(item)
>         words[item] = count
>     return sorted(words.items(), key=lambda item: item[1], reverse=True)
> with open('output.txt', 'a') as token_freqs:
>     with open('input.txt', 'r') as out_tokens:
>         token_list = countWords(out_tokens.read())
>         token_freqs.write(token_list)


When you run that code, are you SURE that it merely results in the 
output file being blank? When I run it, I get an obvious error:

Traceback (most recent call last):
   File "<stdin>", line 4, in <module>
TypeError: argument 1 must be string or read-only character buffer, not list

Don't you get this error too?


The first problem is that file.write() doesn't take a list as argument, 
it requires a string. You feed is a list of (word, frequency) pairs. You 
need to decide how you want to format the output.

The second problem is that you don't actually generate word frequencies, 
you generate letter frequencies. When you read a file, you get a string, 
not a list of words. A string is equivalent to a list of letters:

 >>> for item in "hello":
...     print(item)
...
h
e
l
l
o


Your countWords function itself is reasonable, apart from some stylistic 
issues, and some inefficiencies which are unnoticeable for small numbers 
of words, but will become extremely costly for large lists of words. 
Ignoring that, here's my suggested code: you might like to look at the 
difference between what I have written, and what you have, and see if 
you can tell why I've written what I have.


def countWords(wordlist):
     word_table = {}
     for word in wordlist:
         count = wordlist.count(word)
         word_table[word] = count
     return sorted(
       word_table.items(), key=lambda item: item[1], reverse=True
       )

def getWords(filename):
     with open(filename, 'r') as f:
         words = f.read().split()
     return words

def writeTable(filename, table):
     with open(filename, 'w') as f:
         for word, count in table:
             f.write("%s %s\n" % (word, count))


words = getWords('input.txt')
table = countWords(words)
writeTable('output.txt', table)



For bonus points, you might want to think about why countWords will be 
so inefficient for large word lists, although you probably won't see any 
problems until you're dealing with thousands or tens of thousands of words.


-- 
Steven


More information about the Tutor mailing list