fast regex

Sat May 8 10:08:55 EDT 2010

Tim Chase wrote:
> James wrote:
> > [Tim had written:]
> If the keys in your word_list are more than just words, then the
> regexp may not find them all, and thus not replace them all.  In
> that case you may have to resort to my 2nd regexp which builds
> the 5k branch regexp from your actual dictionary keys:
>
> >>  r = re.compile(r'\b(%s)\b' % (
> >>    '|'.join(re.escape(s) for s in words_list.keys())
> >>    ),
> >>    re.IGNORECASE)
>
> This method on the above dictionary (modified)
>
>    d = {
>      'hello': 'goodbye',
>      'world': 'python',
>      'stuff with spaces?': 'tadah!',
>      }
>
> would create a regexp of
>
>    \b(about|world|stuff\ with\ spaces\?)\b

There's a subtle issue of the order of the keywords. Pythons
(ir)regular expressions do not strictly guarantee to find the longest
match. For example,

re.findall(r'\b(about|about\-faced)\b',
        'does about-faced match?')

returns ['about'], which is the first complete match, not the longest.
Sorting the keywords into longest-first order avoids the problem.

> This has considerable performance implications as len(word_list)
> grows, unless you can figure a way to determine that some
> replacements are more probable than others and push them to the
> front of this regexp, but that's more complex and requires
> knowledge of your word-list.

There is another method which is to factor out common prefixes, so the
re module's search-and-backtrack engine never has a choice of multiple
paths. Instead of:

    \b(abolished|abolition)\b

we'd use:

    \b(aboli(shed|tion)\b

The RE-generator is of course more complex than Tim's one-liner, but I
happen to have code, which I'll include below. It's probably academic,
as I'd agree with Tim that his first solution is the better candidate
at this point.

With my giant prefix-combined RE's, I can re.compile one with 4000
words randomly chosen from a Unix words file, but 5000 results in
"regular expression code size limit exceeded". Tim's version which
doesn't combine prefixes tops out a little lower. This is on 32-bit
Windows, standard distribution. One could, of course, build multiple
smaller RE's, but that starts to suck.

-Bryan Olson

from random import sample
import re

def build_big_re(str_list):
     """ Build a non-backtracking regular expression
        matching any of the words in str_list.
     """

     def trie_insert(table, str):
         if str:
             submap = table.setdefault(str[0], {})
             trie_insert(submap, str[1:])
         else:
             table[""] = {}

     def sequentialize(table, lst):
         if table:
             keys = table.keys()
             keys.sort(key=len, reverse=True)
             lst.append('(?:')
             seperator = ''
             for key in keys:
                 lst.append(seperator + re.escape(key))
                 submap = table[key]
                 while len(submap) == 1:
                     key = submap.keys()[0]
                     submap = submap[key]
                     lst.append(re.escape(key))
                 sequentialize(submap, lst)
                 seperator = '|'
             lst.append(')')

     table = {}
     for str in str_list:
         trie_insert(table, str)
     lst = [r'\b']
     sequentialize(table, lst)
     lst.append(r'\b')
     re_str = "".join(lst)
     # print "RE is: '%s'\n" % re_str
     return re.compile(re_str, re.IGNORECASE)

def simple_re(str_list):
    str_list = sorted(str_list, key=len, reverse=True)
    re_str = r'\b(%s)\b' % (
         '|'.join(re.escape(s) for s in str_list))
    # print "RE is", re_str
    return re.compile(re_str, re.IGNORECASE)

def testy():

    words_file = r'/usr/share/dict/words' # Common Unix
    # words_file = r'C:\z\wordlist.txt' # Just my box

    nkeywords = 3500

    from random import sample
    with open(words_file, 'r') as f:
        allwords = [w.strip() for w in f.readlines()]
    keywords = sample(allwords, nkeywords)
    bigtext = ' '.join(allwords)

    print 'Building simple RE...'
    simpre = simple_re(keywords)
    print 'Run simple re...'
    z1 = simpre.findall(bigtext)
    print 'Done.\n'

    print 'Building complex RE...'
    compre = build_big_re(keywords)
    print 'Run complex re...'
    z2 = compre.findall(bigtext)
    print 'Done.'

    assert z1 == z2

testy()