Counting occurences of words in a list of strings
Robert Kern
rkern at ucsd.edu
Tue May 24 23:45:02 EDT 2005
Travers Naran wrote:
> Here's the basic idea. I have a dictionary of substrings (the substrings
> stored as keys). I have a list of strings. I want to find out, for each
> word in the dictionary, how many times the substring occurs non-overlapping
> occurrences there are in the list of strings. This is the best I could
> come up with:
>
> for j in self.dict.keys():
> c = 0
> for i in tc.strings:
> c += i.count(j)
> self.dict[j]['count'] += c
>
> I tried Boyer-Moore-Horspool as someone else suggested in the group, but it
> was even slower (probably because of the time spent setting up the
> pattern). Even when I split the algorithm in two, doing the set-up before
> the inner loop and only testing the pattern in the inner loop, it was still
> slow.
>
> So, am I doing this the optimal way or am I overlooking something _really_
> simple to do?
>
> Statistics on the data:
> The dictionary has 3000+ entries in it.
> There are several thousand strings.
> The strings average 10-26 characters in length.
Find a character that doesn't exist in any of self.dict.keys(). '\x00'
will probably do.
whole = '\x00'.join(tc.strings)
for key in self.dict.keys():
self.dict[key]['count'] = whole.count(key)
--
Robert Kern
rkern at ucsd.edu
"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter
More information about the Python-list
mailing list