[issue40879] Strange regex cycle

Tim Peters report at bugs.python.org
Fri Jun 5 21:21:10 EDT 2020


Tim Peters <tim at python.org> added the comment:

The repr truncates the pattern string, for display, if it's "too long". The only visual clue about that, though, is that the display is missing the pattern string's closing quote, as in the output you showed here. If you look at url_pat.pattern, though, you'll see that nothing has been lost.

I'm not sure why it does that.  As I vaguely recall, some years ago there was a crusade to limit maximum repr sizes because long output was considered to be "a security issue" (e.g., DoS attacks vis tricking logging/auditing facilities into writing giant strings when recording reprs).

In any case, that's all there is to that part.

For the rest, it's exceedingly unlikely that there's actually an infinite loop. Instead there's a messy regexp with multiple nested quantifiers, which are notorious for exhibiting exponential-time behavior and especially in non-matching cases. They can be rewritten to have linear-time behavior instead, but it's an effort I personally have no interest in pursuing here. See Jeffrey Friedl's "Mastering Regular Expressions" book for detailed explanations.

The reason I have no interest: it's almost always a losing idea to try to parse any aspect of HTML with regexps. Use an HTML parser instead (or for URLs specifically, see urllib.parse).

----------
nosy: +tim.peters

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue40879>
_______________________________________


More information about the Python-bugs-list mailing list