large dictionary creation takes a LOT of time.

possibilitybox possibilitybox at gmail.com
Fri Apr 29 02:51:08 EDT 2005


this code here:


def wordcount(lines):
    for i in range(len(lines)/8):
        words = lines[i].split(" ")
        if not locals().has_key("frequency"):
            frequency = {}
        for word in words:
            if frequency.has_key(word):
                frequency[word] += 1
            else:
                frequency[word] = 1
    return frequency
wordcount(lines)

is taking over six minutes to run on a two megabyte text file.  i
realize that's a big text file, a really big one (it's actually the
full text of don quixote.).  i'm trying to figure out how.  is there a
better way for me to do a frequency count of all the words in the text?
 it seems to me like this should scale linearly, but perhaps it isn't?
i don't know much about algorithmic complexity.  if someone could give
a breakdown of this functions complexity as well i'd be much obliged.

lines is expected to be a list of lines as provided by file.readline()




More information about the Python-list mailing list