regex searching hangs
fortepianissimo
fortepianissimo at yahoo.com.tw
Sun Feb 8 14:29:35 EST 2004
Thanks Tim,
--- Tim Peters <tim.one at comcast.net> wrote:
> [Fortepianissimo]
> > Could someone explains why the following code hangs (Python 2.3.3)?
>
> Yes, but it's not hung, and the answer is involved. Jeffrey Friedl
> wrote a
> book about it, "Mastering Regular Expressions" (O'Reilly). Chapter 4
> is
> available (free) online, and is relevant:
>
> http://www.oreilly.com/catalog/regex/chapter/ch04.html
Good pointer - guess this is new stuff added in the second ed...
> > --- CODE STARTS ---
> > import re
> >
> > p=re.compile(r'\S+(?:\.\S+)+\.com')
> >
> > t='......................................'
> > p.search(t)
> > --- CODE ENDS ---
>
> Run this instead, and you should see that the failure to match takes
> time
> exponential in the number of periods (add a period, and the time
> roughly
> doubles):
> (... )
> Semantically, the nested loop
>
> (?:\.\S+)+
>
> part matches every string of length >= 2 that starts with a period
> and doesn't contain whitespace.
Yup I now see the problem. This coupled with the prefix "\S+" forces
Python to "start" matching "(?:\.\S+)+" from any position, and the
trailing "\.com" forces it to "stop" matching "(?:\.\S+)+" at any
position and to start matching "\.com". Removing "\.com" greatly speeds
up the process. In short, this gives gozillion possibilities for
matching.
> > is halting problem a problem here?
>
> Not in theory, but in practice a match attempt that takes longer than
> the
> age of the universe may be a minor annoyance <wink>.
It would be nice if a "cut-off" value can be passed in so as long as #
of branches reaches the value, the search stops and returns something
to indicate a pre-mature termination.
__________________________________
Do you Yahoo!?
Yahoo! Finance: Get your refund fast by filing online.
http://taxes.yahoo.com/filing.html
More information about the Python-list
mailing list