greedy match wanted

Thu Mar 3 15:00:36 EST 2005

This is very similar to some common problems in developing pyparsing
grammars.  One of my early attempts at parsing CORBA IDL listed the
valid parameter passing mechanisms as ('in' | 'out' | 'inout').  As you
can imagine, I never successfully matched an 'inout', since the 'in'
match came first.  (Pyparsing offers an alternative greedy alternation
operator '^', which tests all alternatives and then selects the longest
match - however, '^' can have performance issues.)  Of course, the
simplest recourse is just to change the definition to ('inout' | 'in' |
'out'), but I wanted to provide a more general mechanism (since I was
getting e-mails from pyparsing users also having this same problem).
The simplest approach is to just sort the list by order, longest to
shortest, but this removes any option the developer may want to try to
front-load the list with options that are expected to occur more
frequently (for example, if defining a list of potential relational
operators, forcing ('=' | '<' | '>' | '<=' | '>=' | '!=') to become
('<=' | '>='| '!=' | '=' | '<' | '>') may cause unnecessary tests of
infrequently used operators).

I ended up adding the oneOf helper to pyparsing, which takes a single
string of space-delimited literals, and then returns a list of literals
separated by '|' operators, with literals minimally reordered, that is,
only reorderded if a longer literal masks some shorter literal.  Also,
accidental duplicates are stripped. Here is the implementation of oneOf
('|' operators generate a MatchFirst object):

.def oneOf( strs, caseless=False ):
.    """Helper to quickly define a set of alternative Literals, and
makes sure to do
.       longest-first testing when there is a conflict, regardless of
the input order,
.       but returns a MatchFirst for best performance.
.    """
.    if caseless:
.        isequal = ( lambda a,b: a.upper() == b.upper() )
.        parseElementClass = CaselessLiteral
.    else:
.        isequal = ( lambda a,b: a == b )
.        parseElementClass = Literal
.
.    symbols = strs.split()
.    i = 0
.    while i < len(symbols)-1:
.        cur = symbols[i]
.        for j,other in enumerate(symbols[i+1:]):
.            if ( isequal(other, cur) ):
.                del symbols[i+j+1]
.                break
.            elif ( isequal(other[:len(cur)],cur) ):
.                del symbols[i+j+1]
.                symbols.insert(i,other)
.                cur = other
.                break
.        else:
.            i += 1
.
.    return MatchFirst( [ parseElementClass(sym) for sym in symbols ] )
.

You could pre-process your regex with something similar, so you can
avoid having to do this (error-prone) checking yourself.

-- Paul