Does my RE rock, or suck?
Bryan Olson
fakeaddress at nowhere.org
Sat Jul 10 13:49:33 EDT 2004
Thanks for responses Gustavo. That's most of what I need to
know.
Gustavo Niemeyer wrote:
[...]
> Given the context (person names), your worst case will probably be
> failing every name.
The names were just for demo. The actual application is to
match resource paths with the most-specific handler.
[...]
> It looks like a sane use of regular expressions.
Since it's not obviously insane, I'll include the code below,
in case anyone else wants to use it or QA it for free.
--Bryan
"""
The Problem: We're given a set of "start strings". Later
we'll be called (many times) with a "path string", and we
need to find the longest of the start strings that is a
prefix of the given path string.
Method: Build a Python regular expression (RE) from the
start strings, such that re.search(path) will match the
desired prefix. The RE has no loop-like things, and never
has two branches beginning with the same character.
Searching should be fast in all cases.
"""
import re
def build_prefix_re(str_list):
""" Given a sequence of strings, return a compiled RE such
that build_prefix_re(str_list).search(x) will match the
longest string in str_list that is a prefix of x.
"""
def trie_insert(map, str):
if str:
submap = map.setdefault(str[0], {})
trie_insert(submap, str[1:])
else:
map[""] = {}
def sequentialize(map, lst):
if map:
keys = map.keys()
# Order so that longer matches are first
keys.sort()
keys.reverse()
lst.append('(?:')
seperator = ''
for key in keys:
lst.append(seperator + re.escape(key))
submap = map[key]
while len(submap) == 1:
# While follow-set is singleton, condense.
key = submap.keys()[0]
submap = submap[key]
lst.append(re.escape(key))
sequentialize(submap, lst)
seperator = '|'
lst.append(')')
map = {}
for str in str_list:
trie_insert(map, str)
lst = ['^']
sequentialize(map, lst)
re_str = "".join(lst)
# print "Prefix finding RE is: '%s'\n" % re_str
return re.compile(re_str)
if __name__ == '__main__':
slist = ["a.b/cde/fg", "bchij", "bivwd", "cdj", "cdjwv", "cdjxi"]
prematcher = build_prefix_re(slist)
for str in slist:
match = prematcher.search(str)
assert(match.group(0) == str)
s = str + 'extraneous'
match = prematcher.search(s)
assert(match.group(0) == str)
for str in ('', 'cd', 'bb', 'cdi', 'xbchij', 'bchiij'):
match = prematcher.search(str)
assert not match, "BAD, found '%s'" % str
More information about the Python-list
mailing list