prefix matching

Wed May 26 16:43:28 EDT 2004

Jeff Epler wrote:
> The last time this question went around, I learned that REs like
>     a|b|c
> essentially tries the alternatives one after another, rather than
> compiling into an FSM

that depends somewhat on what a, b, and c happens to be.

for example,

if all alternatives consist of a single literal character, the entire
subexpression is replaced with a character set ("a|b|c" becomes
"[abc]")

for alternatives that start with literal text or a character set, the
engine never checks alternatives that cannot possible match (if
you feed "aha" to "a...|b...|c...", the second and third alternative
are never checked).

if all alternatives share a common prefix, that prefix will be checked
before any alternative is tried; if the prefix matches, only the suffixes
will be checked for each alternative. ("aa|ab|ac" becomes "a(?:a|b|c)"
becomes "a[abc]")

when searching, the engine uses a KMP-style overlap table to skip
over places where the prefix cannot possibly match.  (which explains
why re.search can sometimes run faster than string.find)

</F>