Counting occurences of words in a list of strings

Tue May 24 23:45:02 EDT 2005

Travers Naran wrote:
> Here's the basic idea.  I have a dictionary of substrings (the substrings 
> stored as keys).  I have a list of strings.  I want to find out, for each 
> word in the dictionary, how many times the substring occurs non-overlapping 
> occurrences there are in the list of strings.  This is the best I could 
> come up with:
> 
>          for j in self.dict.keys():
>              c = 0
>              for i in tc.strings:
>                  c += i.count(j)
>              self.dict[j]['count'] += c
> 
> I tried Boyer-Moore-Horspool as someone else suggested in the group, but it 
> was even slower (probably because of the time spent setting up the 
> pattern).  Even when I split the algorithm in two, doing the set-up before 
> the inner loop and only testing the pattern in the inner loop, it was still 
> slow.
> 
> So, am I doing this the optimal way or am I overlooking something _really_ 
> simple to do?
> 
> Statistics on the data:
> The dictionary has 3000+ entries in it.
> There are several thousand strings.
> The strings average 10-26 characters in length.

Find a character that doesn't exist in any of self.dict.keys(). '\x00' 
will probably do.

whole = '\x00'.join(tc.strings)
for key in self.dict.keys():
     self.dict[key]['count'] = whole.count(key)

-- 
Robert Kern
rkern at ucsd.edu

"In the fields of hell where the grass grows high
  Are the graves of dreams allowed to die."
   -- Richard Harter