Counting occurences of words in a list of strings

Wed May 25 00:24:55 EDT 2005

Travers Naran wrote:
> Here's the basic idea.  I have a dictionary of substrings (the 
> substrings stored as keys).  I have a list of strings.  I want to find 
> out, for each word in the dictionary, how many times the substring 
> occurs non-overlapping occurrences there are in the list of strings.  
> This is the best I could come up with:
> 
>         for j in self.dict.keys():
>             c = 0
>             for i in tc.strings:
>                 c += i.count(j)
>             self.dict[j]['count'] += c
> 
> I tried Boyer-Moore-Horspool as someone else suggested in the group, but 
> it was even slower (probably because of the time spent setting up the 
> pattern).  Even when I split the algorithm in two, doing the set-up 
> before the inner loop and only testing the pattern in the inner loop, it 
> was still slow.

Are you comparing BMH implemented in Python with str.count() implemented 
in C?

> 
> So, am I doing this the optimal way or am I overlooking something 
> _really_ simple to do?
> 
> Statistics on the data:
> The dictionary has 3000+ entries in it.
> There are several thousand strings.
> The strings average 10-26 characters in length.

1. A pure Python suggestion:

Presuming there is a character SEP that cannot appear in the keys in 
your query dictionary, try this:

SEP = '\n' # for example
big_string_count = SEP.join(tc.strings).count
for query_key in self.dict.keys():
     self.dict[query_key]['count'] += big_string_count(query_key)

Why do we have += in the above line? Is there anything else being added 
in by not-shown code? If not, the following would be much more 
comprehensible, and faster:

self.dict_count = {}
SEP = '\n' # for example
big_string_count = SEP.join(tc.strings).count
for query_key in self.dict.keys():
     self.dict_count[query_key] = big_string_count(query_key)

Checking your requirements for "non-overlapping": if you have 'foobar' 
and 'barzot' in the dictionary, and a string contains 'foobarzot', your 
code will count 1 for 'foobar' and 1 for 'barzot'. Is that OK?

2. Next, if you don't mind using a C-extension, look at Danny Yoo's 
ahocorasick module, which will search in parallel for your 3000+ keys.

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

Under one definition of non-overlapping, you will be able to use one of 
the findall methods; under the other definition, you will need to use 
one of the search methods and restart at (matched_position + 1).

3. If you want to roll your own, start with Gonzalo Navarro's 
publications: http://www.dcc.uchile.cl/~gnavarro/subj-multiple.html

HTH,
John