Seeking regex optimizer

Tue Jun 20 11:12:52 EDT 2006

Mirco Wahab wrote:
> Hi, are you the A.Dalke from the Schulten group (VMD) as
> listed here: http://www.ks.uiuc.edu/Overview/People/former.cgi

Yes.  But I left there nearly a decade ago.

>   # naive regex '\d+9'
>   # find some number only if it ends by 9
>   my $str="1009900000000000000000000000000000000000";
>
>   print "non-atomic backtracks: $1\n" if $str =~ / (  \d+9 )/x;
>   print "atomic: matches $1\n"     if $str =~ / ( (?>\d+)9 )/x;

What makes me less than enthusiastic about various new regexp
features is that they can be written other ways.  For example,

>>> import re
>>> pat = re.compile("([0-8]*9)+")
>>> pat.match("1009900000000000000000000000000000000000")
<_sre.SRE_Match object at 0x590a0>
>>> _.group(0)
'10099'
>>> pat = re.compile("([0-8]*9)+(?!\d)")
>>> pat.match("1009900000000000000000000000000000000000")
>>>

They are different ways to reach the same goal.  Understanding
the alternatives requires some non-trivial understanding about
how regexp engines work.  The diversity of solutions with no
clear preference of one over the other overly promotes
idiosyncratic patterns and requires that everyone reading the
regexps must understand all the special syntax contructions
even when a person's working vocabulary only uses a subset
of those.

> Second line would not backtrack and not find the 9 behind \d+
> because atomic grouping prevents that, so line 1 matches
> 10099 and line 2 nothing (but very fast, which is the point).

While my pattern is not as fast.  It's linearly slower than yours
as it does the backtracking.  (Unless the engine is very smart.)
That factor is not something I'm concerned with.

I can also get performance without backtracking (though there
is setup for backtracking) using

>>> pat = re.compile("([0-8]*9)+([0-8]*)")
>>> m = pat.match("1009900000000000000000000000000000000000")
>>> if m.group(2): print "does not end in a 9"
...
does not end in a 9
>>>
>>> m = pat.match("10099000000000000000000000000000000000009a")
>>> if m.group(2): print "does not end in a 9"
...
>>>

> Why is there a named group count
> restriction in Python? Any reasons
> one should know?

While not definite, I think it introduces an ambiguity with octal
escape codes.  (*That's* something to omit in Python 3000!)
Here's the quote from perlre

       There is no limit to the number of captured substrings that you
may
       use.  However Perl also uses \10, \11, etc. as aliases for \010,
\011,
       etc.  (Recall that 0 means octal, so \011 is the character at
number 9
       in your coded character set; which would be the 10th character,
a hori-
       zontal tab under ASCII.)  Perl resolves this ambiguity by
interpreting
       \10 as a backreference only if at least 10 left parentheses have
opened
       before it.  Likewise \11 is a backreference only if at least 11
left
       parentheses have opened before it.  And so on.  \1 through \9
are
       always interpreted as backreferences.

Python doesn't have "\10", "\11", etc. as aliases for "\010", "\011",
etc.

>>> re.compile("\10")
<_sre.SRE_Pattern object at 0x575c0>
>>> pat = re.compile(r"\10")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File
"/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/sre.py",
line 180, in compile
    return _compile(pattern, flags)
  File
"/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/sre.py",
line 227, in _compile
    raise error, v # invalid expression
sre_constants.error: bogus escape: '\\10'
>>> pat = re.compile(r"\010")
>>>

so there is no ambiguity until reaching \100.

>>> pat = re.compile(r"\100")
>>> pat.match("@")
<_sre.SRE_Match object at 0x232560>
>>>

Most likely /F (who implemented the regex engine) decided
to refuse the temptation to guess the meaning, eg by
parens counting as Perl does.

Try this variation of your program to see how Perl changes
the meaning of "\100" based on the input string

for (my $i=98; $i<=103; $i++) {
  my $re = '(\d+)\D' x $i . '\100';
  my $text = join ' ', 0..$i;
  $text .= " @";

  print "Testing $i\n";
  if ($text =~ /$re/) {
    print "  Number of matches: $+\n";
    print "  Last group: ${$i}\n";
  } else {
    print "No matches\n";
  }
}

I get

Testing 98
  Number of matches: 98
  Last group: 98
Testing 99
  Number of matches: 99
  Last group: 99
Testing 100
No matches
Testing 101
No matches
Testing 102
No matches
Testing 103
No matches

                                Andrew
                                dalke at dalkescientific.com