regex searching hangs

fortepianissimo fortepianissimo at yahoo.com.tw
Sun Feb 8 14:29:35 EST 2004


Thanks Tim,

--- Tim Peters <tim.one at comcast.net> wrote:
> [Fortepianissimo]
> > Could someone explains why the following code hangs (Python 2.3.3)?
> 
> Yes, but it's not hung, and the answer is involved.  Jeffrey Friedl
> wrote a
> book about it, "Mastering Regular Expressions" (O'Reilly).  Chapter 4
> is
> available (free) online, and is relevant:
> 
>     http://www.oreilly.com/catalog/regex/chapter/ch04.html

Good pointer - guess this is new stuff added in the second ed...

> > --- CODE STARTS ---
> > import re
> >
> > p=re.compile(r'\S+(?:\.\S+)+\.com')
> >
> > t='......................................'
> > p.search(t)
> > --- CODE ENDS ---
> 
> Run this instead, and you should see that the failure to match takes
> time
> exponential in the number of periods (add a period, and the time
> roughly
> doubles):
> (... )
> Semantically, the nested loop
> 
>     (?:\.\S+)+
> 
> part matches every string of length >= 2 that starts with a period
> and doesn't contain whitespace.  

Yup I now see the problem. This coupled with the prefix "\S+" forces
Python to "start" matching "(?:\.\S+)+" from any position, and the
trailing "\.com" forces it to "stop" matching "(?:\.\S+)+" at any
position and to start matching "\.com". Removing "\.com" greatly speeds
up the process. In short, this gives gozillion possibilities for
matching.

> > is halting problem a problem here?
> 
> Not in theory, but in practice a match attempt that takes longer than
> the
> age of the universe may be a minor annoyance <wink>.

It would be nice if a "cut-off" value can be passed in so as long as #
of branches reaches the value, the search stops and returns something
to indicate a pre-mature termination.

__________________________________
Do you Yahoo!?
Yahoo! Finance: Get your refund fast by filing online.
http://taxes.yahoo.com/filing.html




More information about the Python-list mailing list