Does my RE rock, or suck?

Sat Jul 10 13:49:33 EDT 2004

Thanks for responses Gustavo.  That's most of what I need to
know.

Gustavo Niemeyer wrote:
[...]
 > Given the context (person names), your worst case will probably be
 > failing every name.

The names were just for demo.  The actual application is to
match resource paths with the most-specific handler.

[...]
 > It looks like a sane use of regular expressions.

Since it's not obviously insane, I'll include the code below,
in case anyone else wants to use it or QA it for free.

--Bryan

"""
     The Problem:  We're given a set of "start strings".  Later
     we'll be called (many times) with a "path string", and we
     need to find the longest of the start strings that is a
     prefix of the given path string.

     Method: Build a Python regular expression (RE) from the
     start strings, such that re.search(path) will match the
     desired prefix.  The RE has no loop-like things, and never
     has two branches beginning with the same character.
     Searching should be fast in all cases.

"""

import re

def build_prefix_re(str_list):
     """ Given a sequence of strings, return a compiled RE such
     that build_prefix_re(str_list).search(x) will match the
     longest string in str_list that is a prefix of x.
     """

     def trie_insert(map, str):
         if str:
             submap = map.setdefault(str[0], {})
             trie_insert(submap, str[1:])
         else:
             map[""] = {}

     def sequentialize(map, lst):
         if map:
             keys = map.keys()
             #  Order so that longer matches are first
             keys.sort()
             keys.reverse()
             lst.append('(?:')
             seperator = ''
             for key in keys:
                 lst.append(seperator + re.escape(key))
                 submap = map[key]
                 while len(submap) == 1:
                     #  While follow-set is singleton, condense.
                     key = submap.keys()[0]
                     submap = submap[key]
                     lst.append(re.escape(key))
                 sequentialize(submap, lst)
                 seperator = '|'
             lst.append(')')

     map = {}
     for str in str_list:
         trie_insert(map, str)
     lst = ['^']
     sequentialize(map, lst)
     re_str = "".join(lst)
     # print "Prefix finding RE is: '%s'\n" % re_str
     return re.compile(re_str)

if __name__ == '__main__':
     slist = ["a.b/cde/fg", "bchij", "bivwd", "cdj", "cdjwv", "cdjxi"]
     prematcher = build_prefix_re(slist)
     for str in slist:
         match = prematcher.search(str)
         assert(match.group(0) == str)
         s = str + 'extraneous'
         match = prematcher.search(s)
         assert(match.group(0) == str)
     for str in ('', 'cd', 'bb', 'cdi', 'xbchij', 'bchiij'):
         match = prematcher.search(str)
         assert not match, "BAD, found '%s'" % str