Looking for a regexp generator based on a set of known string representative of a string set
Paul McGuire
ptmcg at austin.rr._bogus_.com
Fri Sep 8 12:29:34 EDT 2006
"Andy Dingley" <dingbat at codesmiths.com> wrote in message
news:1157730937.621029.259320 at i3g2000cwc.googlegroups.com...
>
> vbfoobar at gmail.com wrote:
>
>> I am looking for python code that takes as input a list of strings
>> [...] and outputs the python regular expression
>
> (s1|s2|s3|s4|s5)
> for strings of "s1" etc.
>
> Regex compilers are themselves quite good at optimising beyond this
>
It turns out this problem is a little trickier, especially when one of your
strings is a leading subset of another. For instance, let's say we are
looking for comparison operators, one of <, >, =, >=, <=, or !=. Simply
concatenating with intervening '|' characters gives this regexp:
"<|>|=|<=|>=|!="
However, the leading '<' and '>' alternatives mask the later '<=', '<>', and
'>=' alternatives, and so the regexp never matches the longer options (I was
not able to find a greediness switch that would override this). So when
searching "a >= b" we get this:
>>> re.findall("<|>|=|<=|>=|!=", "a >= b")
['>', '=']
By moving the longer option to the front of the regexp, the longer option is
no longer masked by the shorter:
>>> re.findall(">=|<|>|=|<=|!=", "a >= b")
['>=']
You also can't just concatenate input strings, since it is very likely they
will contain one of the magic re symbols ()[]?*./\+, etc. So re.escape
needs to be called to add the necessary '\'s.
Here is an extract from pyparsing's oneOf function that does something
similar, that handles the leading substring masking problem, and escapes the
input strings, before concatenating them to a valid regexp. Of course, the
simplest approach would be to just sort the symbols by descending length,
but you may have some a priori knowledge that 'X' is a very common match,
and want that option tested as early as possible. So this method only
reorders strings if there is a masking problem.
def createREfrom( symbols ): #symbols is a list of strings
isequal = ( lambda a,b: a == b )
masks = ( lambda a,b: b.startswith(a) )
i = 0
while i < len(symbols)-1:
cur = symbols[i]
for j,other in enumerate(symbols[i+1:]):
if ( isequal(other, cur) ):
del symbols[i+j+1]
break
elif ( masks(cur, other) ):
del symbols[i+j+1]
symbols.insert(i,other)
cur = other
break
else:
i += 1
return "|".join( [ re.escape(sym) for sym in symbols] )
>>> print createREfrom(["ABC","ABCDEF","ABCGHI"])
ABCDEF|ABCGHI|ABC
>>> print createREfrom("> < = <= >= != << >> <<< >>>".split())
\>\=|\>\>\>|\>\>|\>|\<\=|\<\<\<|\<\<|\<|\=|\!\=
>>> re.findall( createREfrom("> < = <= >= != << >> <<< >>>".split()), "a <=
>>> b")
['<=']
Note, this does not do any optimization, such as collapsing "ABCDEF|ABCGHI"
to "ABC(DEF|GHI)". I think there are some recipes in the Python cookbook
for such optimization.
-- Paul
More information about the Python-list
mailing list