help make it faster please

Fri Nov 11 09:53:08 EST 2005

 <pkilambi at gmail.com> wrote:
>Oh sorry indentation was messed here...the
>wordlist = countDict.keys()
>wordlist.sort()
>should be outside the word loop.... now
>def create_words(lines):
>    cnt = 0
>    spl_set = '[",;<>{}_&?!():-[\.=+*\t\n\r]+'
>    for content in lines:
>        words=content.split()
>        countDict={}
>        wordlist = []
>        for w in words:
>            w=string.lower(w)
>            if w[-1] in spl_set: w = w[:-1]
>            if w != '':
>                if countDict.has_key(w):
>                    countDict[w]=countDict[w]+1
>                else:
>                    countDict[w]=1
>        wordlist = countDict.keys()
>        wordlist.sort()
>        cnt += 1
>        if countDict != {}:
>            for word in wordlist: print (word+' '+
>str(countDict[word])+'\n')
>
>ok now this is the correct question I am asking...

(a) You might be better off doing:
    words = words.lower()
    for w in words:
        ...
instead of calling lower() on each separate word (and note that most
functions from string are deprecated in favour of string methods).

(b) spl_set isn't doing what you might think it is -- it looks like
you've written it as a regexp but your using it as a character set.
What you might want is:
    spl_set = '",;<>{}_&?!():-[\.=+*\t\n\r'
and
        while w[-1] in spl_set: w = w[:-1]
That loop can be written:
        w = w.rstrip(spl_set)
(which by my timings is faster if you have multiple characters from
spl_set at the end of your word, but slower if you have 0 or 1).

-- 
\S -- siona at chiark.greenend.org.uk -- http://www.chaos.org.uk/~sion/
  ___  |  "Frankly I have no feelings towards penguins one way or the other"
  \X/  |    -- Arthur C. Clarke
   her nu becomeþ se bera eadward ofdun hlæddre heafdes bæce bump bump bump