Regular Expressions: large amount of or's

Thu Mar 3 12:35:29 EST 2005

On Tue, 1 Mar 2005 15:03:50 -0500, Tim Peters <tim.peters at gmail.com>
wrote:

>[André Søreng]
>> Given a string, I want to find all ocurrences of
>> certain predefined words in that string. Problem is, the list of
>> words that should be detected can be in the order of thousands.
>>
>> With the re module, this can be solved something like this:
>>
>> import re
>>
>> r = re.compile("word1|word2|word3|.......|wordN")
>> r.findall(some_string)
>>
>> Unfortunately, when having more than about 10 000 words in
>> the regexp, I get a regular expression runtime error when
>> trying to execute the findall function (compile works fine, but slow).
>>
>> I don't know if using the re module is the right solution here, any
>> suggestions on alternative solutions or data structures which could
>> be used to solve the problem?
>
>Put the words you're looking for into a set (or as the keys of a dict
>in older Pythons; the values in the dict are irrelevant).
>
>I don't know what you mean by "word", so write something that breaks
>your string into what you mean by words.  Then:
>
>for word in something_that_produces_words(the_string):
>    if word in set_of_words:
>        # found one
>

I have the same problem.
Unfortunately the meaning of a "word" depends on the word.
As an example I would like to count the number of occurrences of
movies titles in some text.

Maybe lex is more optimized?
Unfortunately is seems that there are no lex versions that generate
python (or PyRex) code.

Thanks and regards   Manlio Perillo