Regular Expressions: large amount of or's

Wed Mar 2 02:23:51 EST 2005

Kent Johnson <kent37 at tds.net> wrote:

:> Given a string, I want to find all ocurrences of
:> certain predefined words in that string. Problem is, the list of
:> words that should be detected can be in the order of thousands.
:> 
:> With the re module, this can be solved something like this:
:> 
:> import re
:> 
:> r = re.compile("word1|word2|word3|.......|wordN")
:> r.findall(some_string)

The internal data structure that encodes that set of keywords is
probably humongous.  An alternative approach to this problem is to
tokenize your string into words, and then check to see if each word is
in a defined list of "keywords".  This works if your keywords are
single words:

###
keywords = set([word1, word2, ...])
matchingWords = set(re.findall(r'\w+')).intersection(keywords)
###

Would this approach work for you?

Otherwise, you may want to look at a specialized data structure for
doing mutiple keyword matching; I had an older module that wrapped
around a suffix tree:

    http://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees/

It looks like other folks, thankfully, have written other
implementations of suffix trees:

    http://cs.haifa.ac.il/~shlomo/suffix_tree/

Another approach is something called the Aho-Corasick algorithm:

    http://portal.acm.org/citation.cfm?doid=360825.360855

though I haven't been able to find a nice Python module for this yet.

Best of wishes to you!