[Tutor] Simple counter to determine frequencies of words in adocument

Sat Nov 20 10:49:58 CET 2010

Alan Gauld wrote:

> The loop is a bit clunky. it would be clearer just to iterate over
> a_list:
> 
> for item in a_list:
>    words[item] = a_list.count(item)

This is a very inefficient approach because you repeat counting the number 
of occurrences of a word that appears N times N times:

>>> words = {}
>>> a_list = "in the garden on the bank behind the tree".split()
>>> for word in a_list:
...     print "counting", word
...     words[word] = a_list.count(word)
...
counting in
counting the # <-- 1
counting garden
counting on
counting the # <-- 2
counting bank
counting behind
counting the # <-- 3
counting tree

>>> words
{'on': 1, 'garden': 1, 'tree': 1, 'behind': 1, 'in': 1, 'the': 3, 'bank': 1}

To avoid the duplicate effort you can check if the word was already counted:

>>> words2 = {}
>>> for word in a_list:
...     if word not in words2:
...             print "counting", word
...             words2[word] = a_list.count(word)
...
counting in
counting the
counting garden
counting on
counting bank
counting behind
counting tree
>>> words == words2
True

Inside the count() method you are still implicitly iterating over the entire 
list once for every distinct word. I would instead prefer counting manually 
while iterating over the list once. This has the advantage that it will even 
work if you don't keep the whole sequence of words in a list in memory (e. 
g. if you read them from a file one line at a time):

>>> words3 = {}
>>> for word in a_list:
...     if word in words3:
...             words3[word] += 1
...     else:
...             words3[word] = 1
...
>>> words3 == words
True

Finally there's collections.defaultdict or, in Python 2.7, 
collections.Counter when you are more interested in the result than the way 
to achieve it.

Peter