Seeking regex optimizer

Mirco Wahab wahab at chemie.uni-halle.de
Sun Jun 18 17:58:22 EDT 2006


Thus spoke Kay Schluehr (on 2006-06-18 19:07):

> I have a list of strings ls = [s_1,s_2,...,s_n] and want to create a
> regular expression sx from it, such that sx.match(s) yields a SRE_Match
> object when s starts with an s_i for one i in [0,...,n].  There might
> be relations between those strings: s_k.startswith(s_1) -> True or
> s_k.endswith(s_1) -> True. An extreme case would be ls = ['a', 'aa',
> ...,'aaaa...ab']. For this reason SRE_Match should provide the longest
> possible match.

With some ideas from Kay and Paddy, it tried to get
along with Python in doing this.

If its allowed to spread the individual strings
into alterations, the following would imho do:


#!/usr/bin/python
# -*- coding: iso-8859-15 -*-
text = r'this is a text containing aaaöüöaaaµaaa and more';
lstr = [
      'a',
      'aa',
      'aaaaa',
      'aaaöüöaaaµaaa',
      'aaaaaaaaaaaaaaab'
]

import re
pat  = re.compile(    \
               '(' +  \
               '|'.join(sorted(lstr,lambda a,b: len(b)-len(a)))  + \
                ')',  re.VERBOSE);

hits = sorted( pat.findall(text), lambda a,b: len(b)-len(a) )
print 'Longest: ', hits[0]



This will print 'aaaöüöaaaµaaa' from the text
and won't complain about specuial characters used.

in Perl, you could build up the regex
by dynamic evaluation (??{..}), but I
didn't manage to get this working in Python,
so here is (in Perl) how I thougt it would work:

my $text = "this is a text containing aaaöüöaaaµaaa and more";
my @lstr = (
      'a',
      'aa',
      'aaaaa',
      'aaaöüöaaaµaaa',
      'aaaaaaaaaaaaaaab',
   );

my $re = qr{
            (??{
                join '|',
                   map { quotemeta }
                      sort{ length $b <=> length $a }
                         @lstr
             })
            }x;

$_ = $text;
print "Longest: ", (my $hit) = reverse sort /$re/g;


Maybe the experts can bring some light to it.

Regards

Mirco



More information about the Python-list mailing list