Regular Expressions: large amount of or's

Wed Mar 2 04:31:15 EST 2005

Bill Mill wrote:
> On Tue, 01 Mar 2005 22:04:15 +0100, André Søreng <wsoereng at tiscali.no> wrote:
> 
>>Kent Johnson wrote:
>>
>>>André Søreng wrote:
>>>
>>>
>>>>Hi!
>>>>
>>>>Given a string, I want to find all ocurrences of
>>>>certain predefined words in that string. Problem is, the list of
>>>>words that should be detected can be in the order of thousands.
>>>>
>>>>With the re module, this can be solved something like this:
>>>>
>>>>import re
>>>>
>>>>r = re.compile("word1|word2|word3|.......|wordN")
>>>>r.findall(some_string)
>>>>
>>>>Unfortunately, when having more than about 10 000 words in
>>>>the regexp, I get a regular expression runtime error when
>>>>trying to execute the findall function (compile works fine, but slow).
>>>>
>>>>I don't know if using the re module is the right solution here, any
>>>>suggestions on alternative solutions or data structures which could
>>>>be used to solve the problem?
>>>
>>>
>>>If you can split some_string into individual words, you could look them
>>>up in a set of known words:
>>>
>>>known_words = set("word1 word2 word3 ....... wordN".split())
>>>found_words = [ word for word in some_string.split() if word in
>>>known_words ]
>>>
>>>Kent
>>>
>>>
>>>>André
>>>>
>>
>>That is not exactly what I want. It should discover if some of
>>the predefined words appear as substrings, not only as equal
>>words. For instance, after matching "word2sgjoisejfisaword1yguyg", word2
>>and word1 should be detected.
> 
> 
> Show some initiative, man!
> 
> 
>>>>known_words = set(["word1", "word2"])
>>>>found_words = [word for word in known_words if word in "word2sgjoisejfisawo
> 
> rd1yguyg"]
> 
>>>>found_words
> 
> ['word1', 'word2']
> 
> Peace
> Bill Mill
> bill.mill at gmail.com

Yes, but I was looking for a solution which would scale. Searching 
through the same string 10000+++ times does not seem like a suitable 
solution.

André