Seeking regex optimizer

Sun Jun 18 18:06:44 EDT 2006

On 19/06/2006 6:30 AM, Paddy wrote:
> Kay Schluehr wrote:
>> I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
>> regular expression sx from it, such that sx.match(s) yields a SRE_Match
>> object when s starts with an s_i for one i in [0,...,n].  There might
>> be relations between those strings: s_k.startswith(s_1) -> True or
>> s_k.endswith(s_1) -> True.

Kay, what is the relevance of one string being a suffix of another? I 
don't see how that could affect the outcome.

  An extreme case would be ls = ['a', 'aa',
>> ...,'aaaa...ab']. For this reason SRE_Match should provide the longest
>> possible match.
>>
>> Is there a Python module able to create an optimized regex rx from ls
>> for the given constraints?

Optimised with respect to what? speed? ease of construction?

I presume that you will ensure that the members of the list are unique.

Note that the Python regex engine will consider each candidate in 
Paddy's solution left to right until it gets a match or reaches the end 
(that's why the reverse sort is needed to get longest possible match). 
This is worst-case O(N) where N is the total of the lengths of all the 
strings in your list.

As far as I can see, this is the only basic solution (modulo one or two 
twiddles -- see below) using Python's re.

How many strings in your list? Average/maximum length? Likelihood of 
ls[i].startswith(ls[j]) == True? unicode or str?

Your requirements are rather constrained: "sx.match(s) yields a 
SRE_Match object" ... why do you need this? Surely all you need is 
matched_len (which may be zero) such that s[:matched_len] is the matched 
prefix.

I would have thought the way to approach this would be a simple 
character-by-character tree/trie-driven lookup. This would be worst case 
O(n) where n is the length of the longest string in your list. There may 
well be a Python-callable gadget for this on the net somewhere. Google 
"Danny Yoo ahocorasick" for a Python-callable solution to a similar but 
more complex problem.

A cheap hack using Python's re: divide the problem according to first 
character:

prefix_dict_match = {
     'a': re.compile('alpaca|alligator').match,
     'f': re.compile('foobar|foo').match,
     'z': re.compile('zoo|zebra').match,
     }
if s and s[0] in prefix_dict_match:
     match_obj = prefix_dict_match[s[0]](s)
else:
     match_obj = None

>>
>> Regards,
>> Kay
> 
> A start would be:
>   regexp = "^(" + "|".join(sorted(ls, reverse=True)) + ")"
> But the above does not work if you have special characters in your
> strings.

Paddy, fixing that problem, and "optimising" by removing the redundant 
^() metacharacters:

regexp = "|".join(map(re.escape, sorted(ls, reverse=True)))

Hoping some of this helps,
Regards,
John