large dictionary creation takes a LOT of time.
Kent Johnson
kent37 at tds.net
Fri Apr 29 06:49:32 EDT 2005
possibilitybox wrote:
> this code here:
>
>
> def wordcount(lines):
> for i in range(len(lines)/8):
> words = lines[i].split(" ")
> if not locals().has_key("frequency"):
> frequency = {}
> for word in words:
> if frequency.has_key(word):
> frequency[word] += 1
> else:
> frequency[word] = 1
> return frequency
> wordcount(lines)
>
> is taking over six minutes to run on a two megabyte text file. i
> realize that's a big text file, a really big one (it's actually the
> full text of don quixote.). i'm trying to figure out how. is there a
> better way for me to do a frequency count of all the words in the text?
> it seems to me like this should scale linearly, but perhaps it isn't?
> i don't know much about algorithmic complexity. if someone could give
> a breakdown of this functions complexity as well i'd be much obliged.
>
> lines is expected to be a list of lines as provided by file.readline()
>
Here is a little cleaner version. It takes about a second to run on my PC. What hardware are you
running on?
path = 'DonQuixote.txt'
frequency = {}
for line in open(path):
for word in line.split():
if frequency.has_key(word):
frequency[word] += 1
else:
frequency[word] = 1
print len(frequency), 'words'
Kent
More information about the Python-list
mailing list