Looking for a regexp generator based on a set of known string representative of a string set

Fri Sep 15 19:03:24 EDT 2006

"Paul McGuire" <ptmcg at austin.rr._bogus_.com> wrote in message 
news:OTgMg.15795$o42.8868 at tornado.texas.rr.com...
> "Andy Dingley" <dingbat at codesmiths.com> wrote in message 
> news:1157730937.621029.259320 at i3g2000cwc.googlegroups.com...
>>
>> vbfoobar at gmail.com wrote:
>>
>>> I am looking for python code that takes as input a list of strings
>>> [...] and outputs the python regular expression
>>
>>    (s1|s2|s3|s4|s5)
>> for strings of "s1" etc.
>>
>> Regex compilers are themselves quite good at optimising beyond this
>>
>
> It turns out this problem is a little trickier, especially when one of 
> your strings is a leading subset of another.  For instance, let's say we 
> are looking for comparison operators, one of <, >, =, >=, <=, or !=. 
> Simply concatenating with intervening '|' characters gives this regexp:
>
> "<|>|=|<=|>=|!="
>
> However, the leading '<' and '>' alternatives mask the later '<=', '<>', 
> and '>=' alternatives, and so the regexp never matches the longer options 
> (I was not able to find a greediness switch that would override this).  So 
> when searching "a >= b" we get this:
>
>>>> re.findall("<|>|=|<=|>=|!=", "a >= b")
> ['>', '=']
>
> By moving the longer option to the front of the regexp, the longer option 
> is no longer masked by the shorter:
>
>>>> re.findall(">=|<|>|=|<=|!=", "a >= b")
> ['>=']
>
>
> You also can't just concatenate input strings, since it is very likely 
> they will contain one of the magic re symbols ()[]?*./\+, etc.  So 
> re.escape needs to be called to add the necessary '\'s.
>
> Here is an extract from pyparsing's oneOf function that does something 
> similar, that handles the leading substring masking problem, and escapes 
> the input strings, before concatenating them to a valid regexp.  Of 
> course, the simplest approach would be to just sort the symbols by 
> descending length, but you may have some a priori knowledge that 'X' is a 
> very common match, and want that option tested as early as possible.  So 
> this method only reorders strings if there is a masking problem.
>
>
> def createREfrom( symbols ):  #symbols is a list of strings
>    isequal = ( lambda a,b: a == b )
>    masks = ( lambda a,b: b.startswith(a) )
>    i = 0
>    while i < len(symbols)-1:
>        cur = symbols[i]
>        for j,other in enumerate(symbols[i+1:]):
>            if ( isequal(other, cur) ):
>                del symbols[i+j+1]
>                break
>            elif ( masks(cur, other) ):
>                del symbols[i+j+1]
>                symbols.insert(i,other)
>                cur = other
>                break
>        else:
>            i += 1
>    return "|".join( [ re.escape(sym) for sym in symbols] )
>
>>>> print createREfrom(["ABC","ABCDEF","ABCGHI"])
> ABCDEF|ABCGHI|ABC
>>>> print createREfrom("> < = <= >= != << >> <<< >>>".split())
> \>\=|\>\>\>|\>\>|\>|\<\=|\<\<\<|\<\<|\<|\=|\!\=
>>>> re.findall( createREfrom("> < = <= >= != << >> <<< >>>".split()), "a <= 
>>>> b")
> ['<=']
>
>
> Note, this does not do any optimization, such as collapsing 
> "ABCDEF|ABCGHI" to "ABC(DEF|GHI)".  I think there are some recipes in the 
> Python cookbook for such optimization.
>
> -- Paul
>
>