Regular Expressions: large amount of or's
Nick Craig-Wood
nick at craig-wood.com
Wed Mar 2 01:48:16 EST 2005
André Søreng <wsoereng at tiscali.no> wrote:
> Given a string, I want to find all ocurrences of
> certain predefined words in that string. Problem is, the list of
> words that should be detected can be in the order of thousands.
>
> With the re module, this can be solved something like this:
>
> import re
>
> r = re.compile("word1|word2|word3|.......|wordN")
> r.findall(some_string)
>
> Unfortunately, when having more than about 10 000 words in
> the regexp, I get a regular expression runtime error when
> trying to execute the findall function (compile works fine, but
> slow).
I wrote a regexp optimiser for exactly this case.
Eg a regexp for all 5 letter words starting with re
$ grep -c '^re' /usr/share/dict/words
2727
$ grep '^re' /usr/share/dict/words | ./words-to-regexp.pl 5
re|re's|reac[ht]|rea(?:d|d[sy]|l|lm|m|ms|p|ps|r|r[ms])|reb(?:el|u[st])|rec(?:ap|ta|ur)|red|red's|red(?:id|o|s)|ree(?:d|ds|dy|f|fs|k|ks|l|ls|ve)|ref|ref's|refe[dr]|ref(?:it|s)|re(?:gal|hab|(?:ig|i)n|ins|lax|lay|lic|ly|mit|nal|nd|nds|new|nt|nts|p)|rep's|rep(?:ay|el|ly|s)|rer(?:an|un)|res(?:et|in|t|ts)|ret(?:ch|ry)|re(?:use|v)|rev's|rev(?:el|s|ue)
As you can see its not perfect.
Find it in http://www.craig-wood.com/nick/pub/words-to-regexp.pl
Yes its perl and rather cludgy but may give you ideas!
--
Nick Craig-Wood <nick at craig-wood.com> -- http://www.craig-wood.com/nick
More information about the Python-list
mailing list