Freeze problem with Regular Expression

Thu Jun 26 14:29:28 EDT 2008

On Thu, 26 Jun 2008 11:20:01 -0500, Peter Pearson wrote:
> On 25 Jun 2008 15:20:04 GMT, Kirk <noreply at yahoo.com> wrote:
[snip]
>> the following regular expression matching seems to enter in a infinite 
>> loop:
[snip]
>> import re
>> text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA) 
>> una '
>> re.findall('[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]
>> *[A-Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$', text)
[snip]
>
> If it will help some smarter person identify the problem, it can
> be simplified to this:
>
> re.findall('[^X]*((?:0*X+0*)+\s*a*\s*(?:0*X+0*\s*)*)([^X]*)$',
>            "XXXXXXXXXXXXXXXXX (X" )
>
> This doesn't actually hang, it just takes a long time.  The
> time taken increases quickly as the chain of X's gets longer.

... and can be further reduced to this:

re.findall('(X+)+(X+)*$', "XXXXXXXXXXXXXXXXXXXy" )

which bears considerable resemblance to a regexp that was called
to my attention inconjunction with this report on pathological
regular expressions:

  http://www.cs.rice.edu/~scrosby/hash/

that resulted in my measuring the following data for the
regular expression '(x+x+)+y' in strings consisting of many
x's and maybe a terminal y:

           seconds to scan ...
           -------------------
 # x's     xxx...xy    xxx...x
 -----     --------    -------
   10          0.0        0.0
   11          0.0        0.0
   12          0.0        0.0
   13          0.0        0.01
   14          0.0        0.0
   15          0.0        0.01
   16          0.0        0.03
   17          0.0        0.04
   18          0.0        0.09
   19          0.0        0.16
   20          0.0        0.33
   21          0.0        0.65
   22          0.0        1.3
   23          0.0        2.58
   24          0.0        5.18
   25          0.0        10.27
   26          0.0        20.41

Bottom line: given any regular-expression matcher, you can
find a regular expression and a family of strings such that
the matching time increases exponentially in the length of
the string.

-- 
To email me, substitute nowhere->spamcop, invalid->net.