help make it faster please

Thu Nov 10 13:46:03 EST 2005

pkilambi at gmail.com wrote:
> I wrote this  function which does the following:
> after readling lines from file.It splits and finds the  word occurences
> through a hash table...for some reason this is quite slow..can some one
> help me make it faster...
> f = open(filename)
> lines = f.readlines()
> def create_words(lines):
>     cnt = 0
>     spl_set = '[",;<>{}_&?!():-[\.=+*\t\n\r]+'
>     for content in lines:
>         words=content.split()
>         countDict={}
>         wordlist = []
>         for w in words:
>             w=string.lower(w)
>             if w[-1] in spl_set: w = w[:-1]
>             if w != '':
>                 if countDict.has_key(w):
>                     countDict[w]=countDict[w]+1
>                 else:
>                     countDict[w]=1
>             wordlist = countDict.keys()
>             wordlist.sort()
>         cnt += 1
>         if countDict != {}:
>             for word in wordlist: print (word+' '+
> str(countDict[word])+'\n')
> 
The way this is written you create a new countDict object
for every line of the file, it's not clear that this is
what you meant to do.

Also you are sorting wordlist for every line, not just
the entire file because it is inside the loop that is
processing lines.

Some extra work by testing for empty dictionary:

wordlist=countDict.keys()

then

if countdict != {}:
    for word in wordlist:

if countDict is empty then wordlist will be empty so testing
for it is unnecessary.

Incrementing cnt, but never using it.

I don't think spl_set will do what you want, but I haven't modified
it.  To split on all those characters you are going to need to
use regular expressions not split.

Modified code:

def create_words(lines):
    spl_set = '[",;<>{}_&?!():-[\.=+*\t\n\r]+'
    countDict={}
    for content in lines:
        words=content.split()
        for w in words:
            w=w.lower()
            if w[-1] in spl_set: w = w[:-1]
            if w:
                if countDict.has_key(w):
                    countDict[w]=countDict[w]+1
                else:
                    countDict[w]=1

    return countDict

import time
filename=r'C:\cygwin\usr\share\vim\vim63\doc\version5.txt'
f = open(filename)
lines = f.readlines()
start_time=time.time()
countDict=create_words(lines)
stop_time=time.time()
elapsed_time=stop_time-start_time
wordlist = countDict.keys()
wordlist.sort()
for word in wordlist:
    print "word=%s count=%i" % (word, countDict[word])

print "Elapsed time in create_words function=%.2f seconds" % elapsed_time

I ran this against a 551K text file and it runs in 0.11 seconds
on my machine (3.0Ghz P4).

Larry Bates