large dictionary creation takes a LOT of time.
Paul Rubin
http
Fri Apr 29 03:00:31 EDT 2005
"possibilitybox" <possibilitybox at gmail.com> writes:
> this code here:
>
>
> def wordcount(lines):
> for i in range(len(lines)/8):
> words = lines[i].split(" ")
> if not locals().has_key("frequency"):
> frequency = {}
> for word in words:
> if frequency.has_key(word):
> frequency[word] += 1
> else:
> frequency[word] = 1
> return frequency
> wordcount(lines)
>
> is taking over six minutes to run on a two megabyte text file. i
> realize that's a big text file, a really big one (it's actually the
> full text of don quixote.). i'm trying to figure out how. is there a
> better way for me to do a frequency count of all the words in the text?
2MB is not that large. Your method is ok and shouldn't be that slow
unless you're on a pretty slow PC. Could your machine be short of
memory and paging a lot? You could tweak the code somewhat by moving
the initialization of the frequency dict out of the loop and combining
a few other statements. Also you should use xrange instead of range,
to avoid allocating a big list in memory:
def wordcount(lines):
frequency = {}
for i in xrange(len(lines)/8):
for word in lines[i].split():
frequency[word] = 1 + frequency.get(word, 0)
return frequency
wordcount(lines)
> it seems to me like this should scale linearly, but perhaps it isn't?
> i don't know much about algorithmic complexity. if someone could give
> a breakdown of this functions complexity as well i'd be much obliged.
It should be close to linear, or at worst n log n, depending on what
happens when dicts have to be enlarged as the # of elements increases.
Why are you only processing 1/8th of the lines?
More information about the Python-list
mailing list