help make it faster please

Sun Nov 13 14:21:24 EST 2005

Ron Adam wrote:

> The \w does make a small difference, but not as much as I expected.

that's probably because your benchmark has a lot of dubious overhead:

> word_finder = re.compile('[\w@]+', re.I)

no need to force case-insensitive search here; \w looks for both lower-
and uppercase characters.

>      for match in word_finder.finditer(string.lower()):

since you're using a case-insensitive RE, that lower() call is not necessary.

>          word = match.group(0)

and findall() is of course faster than finditer() + m.group().

>          t = time.clock()
>          for line in lines.splitlines():
>              countDict = foo(line)
>          tt = time.clock()-t

and if you want performance, why are you creating a new dictionary for
each line in the sample?

here's a more optimized RE word finder:

word_finder_2 = re.compile('[\w@]+').findall

def count_words_2(string, word_finder=word_finder_2):
     # avoid global lookups
     countDict = {}
     for word in word_finder(string):
         countDict[word] = countDict.get(word,0) + 1
     return countDict

with your original test on a slow machine, I get

    count_words: 0.29868684 (best of 3)
    count_words_2: 0.17244873 (best of 3)

if I call the function once, on the entire sample string, I get

    count_words: 0.23096036 (best of 3)
    count_words_2: 0.11690620 (best of 3)

</F>