regex searching hangs

Tim Peters tim.one at comcast.net
Sun Feb 8 13:44:21 EST 2004


[Fortepianissimo]
> Could someone explains why the following code hangs (Python 2.3.3)?

Yes, but it's not hung, and the answer is involved.  Jeffrey Friedl wrote a
book about it, "Mastering Regular Expressions" (O'Reilly).  Chapter 4 is
available (free) online, and is relevant:

    http://www.oreilly.com/catalog/regex/chapter/ch04.html

Python's regexp engine is what he calls "Traditional NFA" there.

> --- CODE STARTS ---
> import re
>
> p=re.compile(r'\S+(?:\.\S+)+\.com')
>
> t='......................................'
> p.search(t)
> --- CODE ENDS ---

Run this instead, and you should see that the failure to match takes time
exponential in the number of periods (add a period, and the time roughly
doubles):

import re
from time import clock as now

p = re.compile(r'\S+(?:\.\S+)+\.com')

for n in range(40):
    target = '.' * n
    start = now()
    p.search(target)
    finish = now()
    print "%2d" % n, finish - start

> From the syntax and semantics of regex, the entire t should be
> consumed by the pattern '\S+(?:\.\S+)+'

Right.

> and the search() call should return None.

Also right.

> Is search() doing a depth-first search and can't pulling
> itself out?

It's doing a backtracking search, and your nested looping is forcing it to
try an exponential number of possibilities that can't lead to an *overall*
match in the end.  Semantically, the nested loop

    (?:\.\S+)+

part matches every string of length >= 2 that starts with a period and
doesn't contain whitespace.  That may or may not be what you intended to
match, but, if it is, then

    \.\S+

is a much more efficent way to write that part.  It doesn't help either in
the original that \S also matches \., so (?:\.\S+)+ is highly ambiguous.  If
you didn't intend that the innermost \S could suck up periods too, then the
more precise rewrite of that part as (?:\.[^\s.]+)+ would help a lot on its
own.

> ...
> BTW, I don't think any regex matching should ever hang in any
> circumstances - correct me if I'm wrong:

It shouldn't hang, but it may take a very long time.

> is halting problem a problem here?

Not in theory, but in practice a match attempt that takes longer than the
age of the universe may be a minor annoyance <wink>.





More information about the Python-list mailing list