large dictionary creation takes a LOT of time.

Fri Apr 29 03:00:31 EDT 2005

"possibilitybox" <possibilitybox at gmail.com> writes:

> this code here:
> 
> 
> def wordcount(lines):
>     for i in range(len(lines)/8):
>         words = lines[i].split(" ")
>         if not locals().has_key("frequency"):
>             frequency = {}
>         for word in words:
>             if frequency.has_key(word):
>                 frequency[word] += 1
>             else:
>                 frequency[word] = 1
>     return frequency
> wordcount(lines)
> 
> is taking over six minutes to run on a two megabyte text file.  i
> realize that's a big text file, a really big one (it's actually the
> full text of don quixote.).  i'm trying to figure out how.  is there a
> better way for me to do a frequency count of all the words in the text?

2MB is not that large.  Your method is ok and shouldn't be that slow
unless you're on a pretty slow PC.  Could your machine be short of
memory and paging a lot?  You could tweak the code somewhat by moving
the initialization of the frequency dict out of the loop and combining
a few other statements.  Also you should use xrange instead of range,
to avoid allocating a big list in memory:

    def wordcount(lines):
        frequency = {}
        for i in xrange(len(lines)/8):
            for word in lines[i].split():
                frequency[word] = 1 + frequency.get(word, 0)
        return frequency
    wordcount(lines)

>  it seems to me like this should scale linearly, but perhaps it isn't?
> i don't know much about algorithmic complexity.  if someone could give
> a breakdown of this functions complexity as well i'd be much obliged.

It should be close to linear, or at worst n log n, depending on what
happens when dicts have to be enlarged as the # of elements increases.
Why are you only processing 1/8th of the lines?