[Tutor] Simple counter to determine frequencies of words in adocument

Peter Otten __peter__ at web.de
Sat Nov 20 10:49:58 CET 2010


Alan Gauld wrote:

> The loop is a bit clunky. it would be clearer just to iterate over
> a_list:
> 
> for item in a_list:
>    words[item] = a_list.count(item)

This is a very inefficient approach because you repeat counting the number 
of occurrences of a word that appears N times N times:

>>> words = {}
>>> a_list = "in the garden on the bank behind the tree".split()
>>> for word in a_list:
...     print "counting", word
...     words[word] = a_list.count(word)
...
counting in
counting the # <-- 1
counting garden
counting on
counting the # <-- 2
counting bank
counting behind
counting the # <-- 3
counting tree

>>> words
{'on': 1, 'garden': 1, 'tree': 1, 'behind': 1, 'in': 1, 'the': 3, 'bank': 1}

To avoid the duplicate effort you can check if the word was already counted:

>>> words2 = {}
>>> for word in a_list:
...     if word not in words2:
...             print "counting", word
...             words2[word] = a_list.count(word)
...
counting in
counting the
counting garden
counting on
counting bank
counting behind
counting tree
>>> words == words2
True

Inside the count() method you are still implicitly iterating over the entire 
list once for every distinct word. I would instead prefer counting manually 
while iterating over the list once. This has the advantage that it will even 
work if you don't keep the whole sequence of words in a list in memory (e. 
g. if you read them from a file one line at a time):

>>> words3 = {}
>>> for word in a_list:
...     if word in words3:
...             words3[word] += 1
...     else:
...             words3[word] = 1
...
>>> words3 == words
True

Finally there's collections.defaultdict or, in Python 2.7, 
collections.Counter when you are more interested in the result than the way 
to achieve it.

Peter





More information about the Tutor mailing list