Does my RE rock, or suck?
Michael Hudson
mwh at python.net
Thu Jul 8 08:24:08 EDT 2004
Bryan Olson <fakeaddress at nowhere.org> writes:
> Here's my algorithmic problem:
>
> Given a set of 'trigger' strings, and one 'path' string,
> find the longest of the triggers that is a prefix of the
> path.
>
> The trigger strings may number in the dozens or hundreds, but
> the set is more-or-less static; I'll be matching many paths
> against the same set of triggers. Matching a path should be
> fast in the average case, and non-awful even if an adversary
> designed the path to be evil. This is for a web-server-thing
> that registers handlers for paths that begin with certain
> strings, like most of the cool web servers do.
>
> I've written a function that takes a list of trigger strings,
> and outputs a Python regular expression that should re.find()
> the longest matching trigger. For example, given:
>
> triggers = ["Alice", "Bob" "John", "John Smith",
> "John Smith Jr", "John Smith Sr"]
>
> My re-generator returns:
>
> re.compile('^(Mary|John( Smith( (Sr|Jr)|)|)|Bob)')
>
> Some RE details: I actually use '(?:' instead of just '(', but
> no need to worry about that now. There are never *'s or +'s or
> other loop-like things in the generated RE's. The RE's list
> choices in reverse-lexicographic order; as the doc says, "REs
> separated by "|" are tried from left to right." The order
> should ensure finding the longest matching trigger. In (a
> couple minutes of testing, it seems to work.
>
>
> I've been burned using Python's RE module in the past. Before I
> sink more time into this, I was wondering if someone could tell
> me if the technique makes sense. In particular:
>
> Is this likely to be fast in normal cases? Is there a
> clearly better approach?
Well, the re module is an NFA engine, not a DFA engine, so I'd expect
this re to run in time proportional to the number of triggers.
"Dozens to hundreds" sounds like a fairly small number in context,
but: time it! Only you know what fast enough is.
Here's a cute approach that should have running time independent of
the number of triggers:
def build_dict(prefix_list):
d = {}
for p in prefix_list:
if len(p) > 0:
d.setdefault(p[0], []).append(p[1:])
r = {}
for p in d:
r[p] = build_dict(d[p])
if '' in d[p]:
r[p][''] = 1
return r
def longest_prefix(s, d):
i = 0
j = 0
for c in s:
if c in d:
d = d[c]
i += 1
if '' in d:
j = i
else:
return s[:j]
else:
return s[:j]
if __name__ == '__main__':
d = build_dict(["Alice", "Bob", "John", "John Smith",
"John Smith Jr", "John Smith Sr"])
print longest_prefix("John Smith", d)
print longest_prefix("John Smith", d)
Here's a less cute but probably saner approach:
def prepare(prefix_list):
d = {}
for p in prefix_list:
d.setdefault(len(p), {})[p] = p
l = d.keys()
l.sort()
l.reverse()
return d, l
def longest_prefix2(s, d, l):
for i in l:
r = d[i].get(s[:i])
if r is not None:
return r
return ''
if __name__ == '__main__':
d, l = prepare(["Alice", "Bob", "John", "John Smith",
"John Smith Jr", "John Smith Sr"])
print longest_prefix2("John Smith", d, l)
print longest_prefix2("John Smith", d, l)
this should run in time proportional to the number of different
lengths of trigger (and use rather less memory than my first version).
> What would be the worst-case time?
>
> Am I likely to exceed recursion depth?
Not in 2.4 :-) And I doubt it before that, actually.
> Any other gotcha? Am I likely to find it when I compile the
> RE, or when I try to match a path?
>
>
> Thanks for any help answering these.
This thread from a year or so back might interest:
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&threadm=886c5e4b.0210151635.123359bb@posting.google.com
Cheers,
mwh
--
: Giant screaming pieces of excrement, they are.
I have a feeling that some of the people in here have a
MUCH more exciting time relieving themselves than I do.
-- Mike Sphar & Dave Brown, asr
More information about the Python-list
mailing list