Regular Expressions: large amount of or's

Nick Craig-Wood nick at craig-wood.com
Wed Mar 2 01:48:16 EST 2005


André Søreng <wsoereng at tiscali.no> wrote:
>  Given a string, I want to find all ocurrences of
>  certain predefined words in that string. Problem is, the list of
>  words that should be detected can be in the order of thousands.
> 
>  With the re module, this can be solved something like this:
> 
>  import re
> 
>  r = re.compile("word1|word2|word3|.......|wordN")
>  r.findall(some_string)
> 
>  Unfortunately, when having more than about 10 000 words in
>  the regexp, I get a regular expression runtime error when
>  trying to execute the findall function (compile works fine, but
>  slow).

I wrote a regexp optimiser for exactly this case.

Eg a regexp for all 5 letter words starting with re

$ grep -c '^re' /usr/share/dict/words
2727

$ grep '^re' /usr/share/dict/words  | ./words-to-regexp.pl 5

re|re's|reac[ht]|rea(?:d|d[sy]|l|lm|m|ms|p|ps|r|r[ms])|reb(?:el|u[st])|rec(?:ap|ta|ur)|red|red's|red(?:id|o|s)|ree(?:d|ds|dy|f|fs|k|ks|l|ls|ve)|ref|ref's|refe[dr]|ref(?:it|s)|re(?:gal|hab|(?:ig|i)n|ins|lax|lay|lic|ly|mit|nal|nd|nds|new|nt|nts|p)|rep's|rep(?:ay|el|ly|s)|rer(?:an|un)|res(?:et|in|t|ts)|ret(?:ch|ry)|re(?:use|v)|rev's|rev(?:el|s|ue)

As you can see its not perfect.

Find it in http://www.craig-wood.com/nick/pub/words-to-regexp.pl

Yes its perl and rather cludgy but may give you ideas!

-- 
Nick Craig-Wood <nick at craig-wood.com> -- http://www.craig-wood.com/nick



More information about the Python-list mailing list