Freeze problem with Regular Expression
Peter Pearson
ppearson at nowhere.invalid
Thu Jun 26 14:29:28 EDT 2008
On Thu, 26 Jun 2008 11:20:01 -0500, Peter Pearson wrote:
> On 25 Jun 2008 15:20:04 GMT, Kirk <noreply at yahoo.com> wrote:
[snip]
>> the following regular expression matching seems to enter in a infinite
>> loop:
[snip]
>> import re
>> text = ' MSX INTERNATIONAL HOLDINGS ITALIA srl (di seguito MSX ITALIA)
>> una '
>> re.findall('[^A-Z|0-9]*((?:[0-9]*[A-Z]+[0-9|a-z|\-]*)+\s*[a-z]*\s*(?:[0-9]
>> *[A-Z]+[0-9|a-z|\-]*\s*)*)([^A-Z]*)$', text)
[snip]
>
> If it will help some smarter person identify the problem, it can
> be simplified to this:
>
> re.findall('[^X]*((?:0*X+0*)+\s*a*\s*(?:0*X+0*\s*)*)([^X]*)$',
> "XXXXXXXXXXXXXXXXX (X" )
>
> This doesn't actually hang, it just takes a long time. The
> time taken increases quickly as the chain of X's gets longer.
... and can be further reduced to this:
re.findall('(X+)+(X+)*$', "XXXXXXXXXXXXXXXXXXXy" )
which bears considerable resemblance to a regexp that was called
to my attention inconjunction with this report on pathological
regular expressions:
http://www.cs.rice.edu/~scrosby/hash/
that resulted in my measuring the following data for the
regular expression '(x+x+)+y' in strings consisting of many
x's and maybe a terminal y:
seconds to scan ...
-------------------
# x's xxx...xy xxx...x
----- -------- -------
10 0.0 0.0
11 0.0 0.0
12 0.0 0.0
13 0.0 0.01
14 0.0 0.0
15 0.0 0.01
16 0.0 0.03
17 0.0 0.04
18 0.0 0.09
19 0.0 0.16
20 0.0 0.33
21 0.0 0.65
22 0.0 1.3
23 0.0 2.58
24 0.0 5.18
25 0.0 10.27
26 0.0 20.41
Bottom line: given any regular-expression matcher, you can
find a regular expression and a family of strings such that
the matching time increases exponentially in the length of
the string.
--
To email me, substitute nowhere->spamcop, invalid->net.
More information about the Python-list
mailing list