Does my RE rock, or suck?

Bryan Olson fakeaddress at nowhere.org
Thu Jul 8 02:25:16 EDT 2004


Here's my algorithmic problem:

     Given a set of 'trigger' strings, and one 'path' string,
     find the longest of the triggers that is a prefix of the
     path.

The trigger strings may number in the dozens or hundreds, but
the set is more-or-less static; I'll be matching many paths
against the same set of triggers.  Matching a path should be
fast in the average case, and non-awful even if an adversary
designed the path to be evil.  This is for a web-server-thing
that registers handlers for paths that begin with certain
strings, like most of the cool web servers do.

I've written a function that takes a list of trigger strings,
and outputs a Python regular expression that should re.find()
the longest matching trigger.  For example, given:

     triggers = ["Alice", "Bob" "John", "John Smith",
         "John Smith Jr", "John Smith Sr"]

My re-generator returns:

     re.compile('^(Mary|John( Smith( (Sr|Jr)|)|)|Bob)')

Some RE details: I actually use '(?:' instead of just '(',  but
no need to worry about that now.  There are never *'s or +'s or
other loop-like things in the generated RE's.  The RE's list
choices in reverse-lexicographic order; as the doc says, "REs
separated by "|" are tried from left to right."  The order
should ensure finding the longest matching trigger.  In (a
couple minutes of testing, it seems to work.


I've been burned using Python's RE module in the past.  Before I
sink more time into this, I was wondering if someone could tell
me if the technique makes sense.  In particular:

     Is this likely to be fast in normal cases?  Is there a
     clearly better approach?

     What would be the worst-case time?

     Am I likely to exceed recursion depth?

     Any other gotcha?  Am I likely to find it when I compile the
     RE, or when I try to match a path?


Thanks for any help answering these.

-- 
--Bryan



More information about the Python-list mailing list