Regular Expressions: large amount of or's

Tim Peters tim.peters at gmail.com
Tue Mar 1 15:03:50 EST 2005


[André Søreng]
> Given a string, I want to find all ocurrences of
> certain predefined words in that string. Problem is, the list of
> words that should be detected can be in the order of thousands.
>
> With the re module, this can be solved something like this:
>
> import re
>
> r = re.compile("word1|word2|word3|.......|wordN")
> r.findall(some_string)
>
> Unfortunately, when having more than about 10 000 words in
> the regexp, I get a regular expression runtime error when
> trying to execute the findall function (compile works fine, but slow).
>
> I don't know if using the re module is the right solution here, any
> suggestions on alternative solutions or data structures which could
> be used to solve the problem?

Put the words you're looking for into a set (or as the keys of a dict
in older Pythons; the values in the dict are irrelevant).

I don't know what you mean by "word", so write something that breaks
your string into what you mean by words.  Then:

for word in something_that_produces_words(the_string):
    if word in set_of_words:
        # found one

This takes expected-case time proportional to the number of words in
the string, + setup time proportional to the number of "interesting"
words (the time needed to create a set or dict from them).  10,000
interesting words won't even start to strain it.



More information about the Python-list mailing list