Regular Expressions: large amount of or's

André Søreng wsoereng at tiscali.no
Tue Mar 1 16:04:15 EST 2005


Kent Johnson wrote:
> André Søreng wrote:
> 
>>
>> Hi!
>>
>> Given a string, I want to find all ocurrences of
>> certain predefined words in that string. Problem is, the list of
>> words that should be detected can be in the order of thousands.
>>
>> With the re module, this can be solved something like this:
>>
>> import re
>>
>> r = re.compile("word1|word2|word3|.......|wordN")
>> r.findall(some_string)
>>
>> Unfortunately, when having more than about 10 000 words in
>> the regexp, I get a regular expression runtime error when
>> trying to execute the findall function (compile works fine, but slow).
>>
>> I don't know if using the re module is the right solution here, any
>> suggestions on alternative solutions or data structures which could
>> be used to solve the problem?
> 
> 
> If you can split some_string into individual words, you could look them 
> up in a set of known words:
> 
> known_words = set("word1 word2 word3 ....... wordN".split())
> found_words = [ word for word in some_string.split() if word in 
> known_words ]
> 
> Kent
> 
>>
>> André
>>

That is not exactly what I want. It should discover if some of
the predefined words appear as substrings, not only as equal
words. For instance, after matching "word2sgjoisejfisaword1yguyg", word2
and word1 should be detected.



More information about the Python-list mailing list