re bug

Thomas Rast foo.bar at freesurf.ch.invalid
Tue Oct 5 07:46:19 EDT 2004


klaus_neuner82 at yahoo.de (Klaus Neuner) writes:

> The regex rx compiles. It can be used to search the strings str1 and
> str2. Yet, when used with str3, the program will not terminate for
> days.

Here's a more readable version of the regexp in question (I assume the
newlines were introduced by your mail program and not part of the
original string), which should be equivalent to your form if compiled
with the re.VERBOSE flag:

'''
(?:^|\ )
(?P<ALLES>
   (?:
      [^/ ]*/[^/ ]*/
      (?:                 # why a group?
         cn
         (?: :[^/ #]+ )*
      )
   )*
   [^/ ]*/[^/ ]*/
   (?:                    # why a group?
      cn
      (?: :[^: /#]+ )* 
      :2
      (?: :[^: /#]+ )*
      :3
      (?: :[^: /#]+ )*
   )
)
(?=(\ |$))
'''

Judging from the number of '*' and '+' quantifiers, the long search
time may be due to excessive backtracking as the regexp engine tries
to find a match.

[Now that I'm looking this through for the N-th time over, maybe it's
not, maybe you really found a bug.  But I've already written the part
below and don't want to waste the time I invested in analysing your
regexp by not posting it...]

> At the time being, I can neither avoid these regular expressions, nor
> can I afford to wait days for the result of my program.

Your example imposes a very strong structure on a possible match: It
consists of small "tokens" (for lack of a better word) delimited by
'/' and ':'.  Try simplifying the regexp to something like (re.VERBOSE
again):

'''
(
 [^ /]*/[^ /]*/    # two "tokens" followed by a slash each
 cn
 (:[^ /:]+)*       # "tokens" that start with a colon
)*
'''

then .split() the string you're trying to search (this is a far more
efficient way of expressing your "boundary conditions") and .match()
against each of the words.  If you have a match, check if it fits the
rest of the requirements, i.e. look for the :2 and :3 tokens etc.

HTH
Thomas

-- 
If you want to reply by mail, substitute my first and last name for
'foo' and 'bar', respectively, and remove '.invalid'.



More information about the Python-list mailing list