Counting occurences of words in a list of strings
John Machin
sjmachin at lexicon.net
Wed May 25 00:24:55 EDT 2005
Travers Naran wrote:
> Here's the basic idea. I have a dictionary of substrings (the
> substrings stored as keys). I have a list of strings. I want to find
> out, for each word in the dictionary, how many times the substring
> occurs non-overlapping occurrences there are in the list of strings.
> This is the best I could come up with:
>
> for j in self.dict.keys():
> c = 0
> for i in tc.strings:
> c += i.count(j)
> self.dict[j]['count'] += c
>
> I tried Boyer-Moore-Horspool as someone else suggested in the group, but
> it was even slower (probably because of the time spent setting up the
> pattern). Even when I split the algorithm in two, doing the set-up
> before the inner loop and only testing the pattern in the inner loop, it
> was still slow.
Are you comparing BMH implemented in Python with str.count() implemented
in C?
>
> So, am I doing this the optimal way or am I overlooking something
> _really_ simple to do?
>
> Statistics on the data:
> The dictionary has 3000+ entries in it.
> There are several thousand strings.
> The strings average 10-26 characters in length.
1. A pure Python suggestion:
Presuming there is a character SEP that cannot appear in the keys in
your query dictionary, try this:
SEP = '\n' # for example
big_string_count = SEP.join(tc.strings).count
for query_key in self.dict.keys():
self.dict[query_key]['count'] += big_string_count(query_key)
Why do we have += in the above line? Is there anything else being added
in by not-shown code? If not, the following would be much more
comprehensible, and faster:
self.dict_count = {}
SEP = '\n' # for example
big_string_count = SEP.join(tc.strings).count
for query_key in self.dict.keys():
self.dict_count[query_key] = big_string_count(query_key)
Checking your requirements for "non-overlapping": if you have 'foobar'
and 'barzot' in the dictionary, and a string contains 'foobarzot', your
code will count 1 for 'foobar' and 1 for 'barzot'. Is that OK?
2. Next, if you don't mind using a C-extension, look at Danny Yoo's
ahocorasick module, which will search in parallel for your 3000+ keys.
http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/
Under one definition of non-overlapping, you will be able to use one of
the findall methods; under the other definition, you will need to use
one of the search methods and restart at (matched_position + 1).
3. If you want to roll your own, start with Gonzalo Navarro's
publications: http://www.dcc.uchile.cl/~gnavarro/subj-multiple.html
HTH,
John
More information about the Python-list
mailing list