Does my RE rock, or suck?

Thu Jul 8 08:24:08 EDT 2004

Bryan Olson <fakeaddress at nowhere.org> writes:

> Here's my algorithmic problem:
> 
>      Given a set of 'trigger' strings, and one 'path' string,
>      find the longest of the triggers that is a prefix of the
>      path.
> 
> The trigger strings may number in the dozens or hundreds, but
> the set is more-or-less static; I'll be matching many paths
> against the same set of triggers.  Matching a path should be
> fast in the average case, and non-awful even if an adversary
> designed the path to be evil.  This is for a web-server-thing
> that registers handlers for paths that begin with certain
> strings, like most of the cool web servers do.
> 
> I've written a function that takes a list of trigger strings,
> and outputs a Python regular expression that should re.find()
> the longest matching trigger.  For example, given:
> 
>      triggers = ["Alice", "Bob" "John", "John Smith",
>          "John Smith Jr", "John Smith Sr"]
> 
> My re-generator returns:
> 
>      re.compile('^(Mary|John( Smith( (Sr|Jr)|)|)|Bob)')
> 
> Some RE details: I actually use '(?:' instead of just '(',  but
> no need to worry about that now.  There are never *'s or +'s or
> other loop-like things in the generated RE's.  The RE's list
> choices in reverse-lexicographic order; as the doc says, "REs
> separated by "|" are tried from left to right."  The order
> should ensure finding the longest matching trigger.  In (a
> couple minutes of testing, it seems to work.
> 
> 
> I've been burned using Python's RE module in the past.  Before I
> sink more time into this, I was wondering if someone could tell
> me if the technique makes sense.  In particular:
> 
>      Is this likely to be fast in normal cases?  Is there a
>      clearly better approach?

Well, the re module is an NFA engine, not a DFA engine, so I'd expect
this re to run in time proportional to the number of triggers.
"Dozens to hundreds" sounds like a fairly small number in context,
but: time it!  Only you know what fast enough is.

Here's a cute approach that should have running time independent of
the number of triggers:

def build_dict(prefix_list):
    d = {}
    for p in prefix_list:
        if len(p) > 0:
            d.setdefault(p[0], []).append(p[1:])
    r = {}
    for p in d:            
        r[p] = build_dict(d[p])
        if '' in d[p]:
            r[p][''] = 1
    return r

def longest_prefix(s, d):
    i = 0
    j = 0
    for c in s:
        if c in d:
            d = d[c]
            i += 1
            if '' in d:
                j = i
        else:
            return s[:j]
    else:
        return s[:j]

if __name__ == '__main__':
    d = build_dict(["Alice", "Bob", "John", "John Smith",
                "John Smith Jr", "John Smith Sr"])
    print longest_prefix("John Smith", d)
    print longest_prefix("John  Smith", d)

Here's a less cute but probably saner approach:

def prepare(prefix_list):
    d = {}
    for p in prefix_list:
        d.setdefault(len(p), {})[p] = p
    l = d.keys()
    l.sort()
    l.reverse()
    return d, l

def longest_prefix2(s, d, l):
    for i in l:
        r = d[i].get(s[:i])
        if r is not None:
            return r
    return ''

if __name__ == '__main__':
    d, l = prepare(["Alice", "Bob", "John", "John Smith",
                "John Smith Jr", "John Smith Sr"])
    print longest_prefix2("John Smith", d, l)
    print longest_prefix2("John  Smith", d, l)

this should run in time proportional to the number of different
lengths of trigger (and use rather less memory than my first version).

>      What would be the worst-case time?
> 
>      Am I likely to exceed recursion depth?

Not in 2.4 :-)  And I doubt it before that, actually.

>      Any other gotcha?  Am I likely to find it when I compile the
>      RE, or when I try to match a path?
> 
> 
> Thanks for any help answering these.

This thread from a year or so back might interest:

http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&threadm=886c5e4b.0210151635.123359bb@posting.google.com

Cheers,
mwh

-- 
  : Giant screaming pieces of excrement, they are.
  I have a feeling that some of the people in here have a 
  MUCH more exciting time relieving themselves than I do.
                                       -- Mike Sphar & Dave Brown, asr