help make it faster please

Sun Nov 13 17:03:13 EST 2005

Fredrik Lundh wrote:
> Ron Adam wrote:
> 
> 
>>The \w does make a small difference, but not as much as I expected.
> 
> 
> that's probably because your benchmark has a lot of dubious overhead:

I think it does what the OP described, but that may not be what he 
really needs.

Although the test to find best of n, instead was finding worse of n. 
Which explains why I was getting a larger variance than I thought I 
should have been getting.

>>word_finder = re.compile('[\w@]+', re.I)
> 
> 
> no need to force case-insensitive search here; \w looks for both lower-
> and uppercase characters.

But the dictionary keys need to be either upper or lower otherwise you 
count 'The' separately from 'the'.

>>     for match in word_finder.finditer(string.lower()):
> 
> 
> since you're using a case-insensitive RE, that lower() call is not necessary.
> 
> 
>>         word = match.group(0)
> 
> 
> and findall() is of course faster than finditer() + m.group().

Cool,  I don't use re that often so I just used what was posted to test 
against.

> 
>>         t = time.clock()
>>         for line in lines.splitlines():
>>             countDict = foo(line)
>>         tt = time.clock()-t
> 
> 
> and if you want performance, why are you creating a new dictionary for
> each line in the sample?

Because that's what the OP apparently wanted.  A line by line word 
count.  I originally did it to get an the over all count and then change 
it so it matched the re version that was posted.

> here's a more optimized RE word finder:
> 
> word_finder_2 = re.compile('[\w@]+').findall
> 
> def count_words_2(string, word_finder=word_finder_2):
>      # avoid global lookups
>      countDict = {}
>      for word in word_finder(string):
>          countDict[word] = countDict.get(word,0) + 1
>      return countDict
> 
> with your original test on a slow machine, I get
> 
>     count_words: 0.29868684 (best of 3)
>     count_words_2: 0.17244873 (best of 3)
> 
> if I call the function once, on the entire sample string, I get
> 
>     count_words: 0.23096036 (best of 3)
>     count_words_2: 0.11690620 (best of 3)
> 
> </F>

Wow, a lot bigger difference than on my machine.  <curious>  An athlon 
64 3000+ on winxp.  I'm not sure how much difference that would make?

This is what I get after adding the above version to it, with the 
lower(). There's not quite as big a difference as you get, but the find 
all version is still faster than both the others.

Cheers,
    Ron

Character count: 100000
Word count: 16477
Average word size: 6.06906597075
word_counter: 0.06245989 (best of 3)
count_words: 0.07309812 (best of 3)
count_words_2: 0.04981024 (best of 3)

And as count all words...

Character count: 100000
Word count: 16477
Average word size: 6.06906597075
word_counter: 0.05325006 (best of 3)
count_words: 0.05910528 (best of 3)
count_words_2: 0.03748158 (best of 3)

They all improve, but the re find all version is clearly better.

#####################

import string
import re
import time
import random

# Create a really ugly n length string to test with.
# The word length are
n = 100000
random.seed(1)
lines = ''.join([ random.choice(string.ascii_letters * 2
                   + '_@$&*()#/<>' + '    \n' * 6) for x in range(n) ])
print 'Character count:', n
print 'Word count:', len(lines.split())
print 'Average word size:', float(n)/len(lines.split())

letters = string.lowercase + string.digits + '_@'

def word_iter(text, letters=letters):
     wd = ''
     for c in text + ' ':
         if c in letters:
             wd += c
         elif wd != '':
             yield wd
             wd = ''

def word_counter(text):
     countDict={}
     for wd in word_iter(text.lower()):
         if wd in countDict:
             countDict[wd] += 1
         else:
             countDict[wd] = 1
     return countDict

word_finder = re.compile('[\w@]+', re.I).finditer

def count_words(string, word_finder=word_finder):
     # avoid global lookups
     countDict = {}
     for match in word_finder(string.lower()):
         word = match.group(0)
         countDict[word] = countDict.get(word,0) + 1
     return countDict

word_finder_2 = re.compile('[\w@]+').findall

def count_words_2(string, word_finder=word_finder_2):
      # avoid global lookups
      countDict = {}
      for word in word_finder(string.lower()):
          countDict[word] = countDict.get(word,0) + 1
      return countDict

foos = [word_counter, count_words, count_words_2]
r1 = r2 = None
for foo in foos:
     best_time = 1000000  # too large to be useful on purpose
     for n in range(3):
         t = time.clock()
         #for line in lines.splitlines():
         countDict = foo(lines)
         tt = time.clock()-t
         best_time = min(tt, best_time)

     r1 = r2
     r2 = countDict
     if r1 != None:
         # change to 1 if assert fails to find problem
         if 0:
             for k in r1.keys():
                 if r1[k] != r2[k]:
                     print k,r1[k],r2[k]
         assert r1 == r2

     print '%s: %.8f (best of %d)' \
           % (foo.__name__, best_time, n+1)