[Tutor] Simple counter to determine frequencies of words in adocument
Peter Otten
__peter__ at web.de
Sat Nov 20 10:49:58 CET 2010
Alan Gauld wrote:
> The loop is a bit clunky. it would be clearer just to iterate over
> a_list:
>
> for item in a_list:
> words[item] = a_list.count(item)
This is a very inefficient approach because you repeat counting the number
of occurrences of a word that appears N times N times:
>>> words = {}
>>> a_list = "in the garden on the bank behind the tree".split()
>>> for word in a_list:
... print "counting", word
... words[word] = a_list.count(word)
...
counting in
counting the # <-- 1
counting garden
counting on
counting the # <-- 2
counting bank
counting behind
counting the # <-- 3
counting tree
>>> words
{'on': 1, 'garden': 1, 'tree': 1, 'behind': 1, 'in': 1, 'the': 3, 'bank': 1}
To avoid the duplicate effort you can check if the word was already counted:
>>> words2 = {}
>>> for word in a_list:
... if word not in words2:
... print "counting", word
... words2[word] = a_list.count(word)
...
counting in
counting the
counting garden
counting on
counting bank
counting behind
counting tree
>>> words == words2
True
Inside the count() method you are still implicitly iterating over the entire
list once for every distinct word. I would instead prefer counting manually
while iterating over the list once. This has the advantage that it will even
work if you don't keep the whole sequence of words in a list in memory (e.
g. if you read them from a file one line at a time):
>>> words3 = {}
>>> for word in a_list:
... if word in words3:
... words3[word] += 1
... else:
... words3[word] = 1
...
>>> words3 == words
True
Finally there's collections.defaultdict or, in Python 2.7,
collections.Counter when you are more interested in the result than the way
to achieve it.
Peter
More information about the Tutor
mailing list