Counting occurences of words in a list of strings

Travers Naran tnaran at no-more-virii-please.direct.ca
Wed May 25 01:24:22 EDT 2005


John Machin wrote:
> Are you comparing BMH implemented in Python with str.count() implemented 
> in C?

Tee-hee.  Yes.  I figured out that was the problem pretty quickly and just 
went back to count().

> 1. A pure Python suggestion:
> 
> Presuming there is a character SEP that cannot appear in the keys in 
> your query dictionary, try this:
> 
> SEP = '\n' # for example
> big_string_count = SEP.join(tc.strings).count
> for query_key in self.dict.keys():
>     self.dict[query_key]['count'] += big_string_count(query_key)
> 
> Why do we have += in the above line? Is there anything else being added 
> in by not-shown code? If not, the following would be much more 
> comprehensible, and faster:

There is.  Because the program can loop over mutliple input files, it needs 
to count it across the files.  But that's a minor thing.

> Checking your requirements for "non-overlapping": if you have 'foobar' 
> and 'barzot' in the dictionary, and a string contains 'foobarzot', your 
> code will count 1 for 'foobar' and 1 for 'barzot'. Is that OK?

Exactly.  Let me be more precise: non-overlapping with itself.  I.e., if I 
has "aa", then "aaaa" should count 2.  But foorbar and barzot both coming 
back 1 in your example is exactly the behavoir I want.

> 2. Next, if you don't mind using a C-extension, look at Danny Yoo's 
> ahocorasick module, which will search in parallel for your 3000+ keys.
> 
> http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/
> 
> Under one definition of non-overlapping, you will be able to use one of 
> the findall methods; under the other definition, you will need to use 
> one of the search methods and restart at (matched_position + 1).

Nifty!  Thanks muchly.

> 3. If you want to roll your own, start with Gonzalo Navarro's 
> publications: http://www.dcc.uchile.cl/~gnavarro/subj-multiple.html

I don't suffer from NMH syndrome.  If ahocorasick does the job, or even 
count, I'm OK with that.



More information about the Python-list mailing list