Seeking regex optimizer

Mirco Wahab peace.is.our.profession at gmx.de
Tue Jun 20 09:05:38 EDT 2006


Thus spoke andrewdalke at gmail.com (on 2006-06-20 01:39):

Hi, are you the A.Dalke from the Schulten group (VMD) as
listed here: http://www.ks.uiuc.edu/Overview/People/former.cgi

> Replying to me Mirco Wahab wrote:
>> If you pull the strings into (?>( ... )) (atomic groups),
>> this would't happen.
> 
> Given that Python's re engine doesn't support this feature
> it doesn't really help the original poster's problem.

Yeah, right. I misunderstood the problem
 - where it was meant to be 'literally'
a problem of capture group id's. I took
it metaphorical, which meant to me
'problems by backtracking' (which was wrong).

> Even if some future Python did support it, the limit
> to 100 named groups is unaffected by backtracking.

Right for that one, what I had in mind was something
like the following:

  # naive regex '\d+9'
  # find some number only if it ends by 9
  my $str="1009900000000000000000000000000000000000";

  print "non-atomic backtracks: $1\n" if $str =~ / (  \d+9 )/x;
  print "atomic: matches $1\n"     if $str =~ / ( (?>\d+)9 )/x;

Second line would not backtrack and not find the 9 behind \d+
because atomic grouping prevents that, so line 1 matches
10099 and line 2 nothing (but very fast, which is the point).

>>>> import re
>>>> re.compile("(.)"*100)
> Traceback (most recent call last):
> ...
> ...
> AssertionError: sorry, but this version only supports 100 named groups
>>>>

I tried:

  my $re = '(\d+)\D' x 10000;
  my $text = join ' ', 0..10000;

  print $+ if $text =~ /$re/;

  print "\n", $1001;

which prints:
  9999
  1000

Why is there a named group count
restriction in Python? Any reasons
one should know?

Regards & thanks

Mirco



More information about the Python-list mailing list