Counting occurences of words in a list of strings
Travers Naran
tnaran at no-more-virii-please.direct.ca
Wed May 25 01:24:22 EDT 2005
John Machin wrote:
> Are you comparing BMH implemented in Python with str.count() implemented
> in C?
Tee-hee. Yes. I figured out that was the problem pretty quickly and just
went back to count().
> 1. A pure Python suggestion:
>
> Presuming there is a character SEP that cannot appear in the keys in
> your query dictionary, try this:
>
> SEP = '\n' # for example
> big_string_count = SEP.join(tc.strings).count
> for query_key in self.dict.keys():
> self.dict[query_key]['count'] += big_string_count(query_key)
>
> Why do we have += in the above line? Is there anything else being added
> in by not-shown code? If not, the following would be much more
> comprehensible, and faster:
There is. Because the program can loop over mutliple input files, it needs
to count it across the files. But that's a minor thing.
> Checking your requirements for "non-overlapping": if you have 'foobar'
> and 'barzot' in the dictionary, and a string contains 'foobarzot', your
> code will count 1 for 'foobar' and 1 for 'barzot'. Is that OK?
Exactly. Let me be more precise: non-overlapping with itself. I.e., if I
has "aa", then "aaaa" should count 2. But foorbar and barzot both coming
back 1 in your example is exactly the behavoir I want.
> 2. Next, if you don't mind using a C-extension, look at Danny Yoo's
> ahocorasick module, which will search in parallel for your 3000+ keys.
>
> http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/
>
> Under one definition of non-overlapping, you will be able to use one of
> the findall methods; under the other definition, you will need to use
> one of the search methods and restart at (matched_position + 1).
Nifty! Thanks muchly.
> 3. If you want to roll your own, start with Gonzalo Navarro's
> publications: http://www.dcc.uchile.cl/~gnavarro/subj-multiple.html
I don't suffer from NMH syndrome. If ahocorasick does the job, or even
count, I'm OK with that.
More information about the Python-list
mailing list