Regular Expressions: large amount of or's

Tue Mar 1 16:19:31 EST 2005

On Tue, 01 Mar 2005 22:04:15 +0100, André Søreng <wsoereng at tiscali.no> wrote:
> Kent Johnson wrote:
> > André Søreng wrote:
> >
> >>
> >> Hi!
> >>
> >> Given a string, I want to find all ocurrences of
> >> certain predefined words in that string. Problem is, the list of
> >> words that should be detected can be in the order of thousands.
> >>
> >> With the re module, this can be solved something like this:
> >>
> >> import re
> >>
> >> r = re.compile("word1|word2|word3|.......|wordN")
> >> r.findall(some_string)
> >>
> >> Unfortunately, when having more than about 10 000 words in
> >> the regexp, I get a regular expression runtime error when
> >> trying to execute the findall function (compile works fine, but slow).
> >>
> >> I don't know if using the re module is the right solution here, any
> >> suggestions on alternative solutions or data structures which could
> >> be used to solve the problem?
> >
> >
> > If you can split some_string into individual words, you could look them
> > up in a set of known words:
> >
> > known_words = set("word1 word2 word3 ....... wordN".split())
> > found_words = [ word for word in some_string.split() if word in
> > known_words ]
> >
> > Kent
> >
> >>
> >> André
> >>
> 
> That is not exactly what I want. It should discover if some of
> the predefined words appear as substrings, not only as equal
> words. For instance, after matching "word2sgjoisejfisaword1yguyg", word2
> and word1 should be detected.

Show some initiative, man!

>>> known_words = set(["word1", "word2"])
>>> found_words = [word for word in known_words if word in "word2sgjoisejfisawo
rd1yguyg"]
>>> found_words
['word1', 'word2']

Peace
Bill Mill
bill.mill at gmail.com