Regular Expressions: large amount of or's

Wed Mar 2 05:35:21 EST 2005

Daniel Yoo wrote:
> Kent Johnson <kent37 at tds.net> wrote:
> 
> :> Given a string, I want to find all ocurrences of
> :> certain predefined words in that string. Problem is, the list of
> :> words that should be detected can be in the order of thousands.
> :> 
> :> With the re module, this can be solved something like this:
> :> 
> :> import re
> :> 
> :> r = re.compile("word1|word2|word3|.......|wordN")
> :> r.findall(some_string)
> 
> The internal data structure that encodes that set of keywords is
> probably humongous.  An alternative approach to this problem is to
> tokenize your string into words, and then check to see if each word is
> in a defined list of "keywords".  This works if your keywords are
> single words:
> 
> ###
> keywords = set([word1, word2, ...])
> matchingWords = set(re.findall(r'\w+')).intersection(keywords)
> ###
> 
> Would this approach work for you?
> 
> 
> 
> Otherwise, you may want to look at a specialized data structure for
> doing mutiple keyword matching; I had an older module that wrapped
> around a suffix tree:
> 
>     http://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees/
> 
> It looks like other folks, thankfully, have written other
> implementations of suffix trees:
> 
>     http://cs.haifa.ac.il/~shlomo/suffix_tree/
> 
> Another approach is something called the Aho-Corasick algorithm:
> 
>     http://portal.acm.org/citation.cfm?doid=360825.360855
> 
> though I haven't been able to find a nice Python module for this yet.
> 
> 
> Best of wishes to you!

Thanks, seems like the Aho-Corasick algorithm is along the lines of
what I was looking for, but have not read the article completely yet.

Also:
http://alexandria.tue.nl/extra1/wskrap/publichtml/200407.pdf

provided several alternative algorithms.

André