+ in regular expression

MRAB python at mrabarnett.plus.com
Fri Oct 5 12:07:47 EDT 2012


On 2012-10-05 16:27, Evan Driscoll wrote:
> On 10/05/2012 04:23 AM, Duncan Booth wrote:
>> A regular expression element may be followed by a quantifier.
>> Quantifiers are '*', '+', '?', '{n}', '{n,m}' (and lazy quantifiers
>> '*?', '+?', '{n,m}?'). There's nothing in the regex language which says
>> you can follow an element with two quantifiers.
> In fact, *you* did -- the first sentence of that paragraph! :-)
>
> \s is a regex, so you can follow it with a quantifier and get \s{6}.
> That's also a regex, so you should be able to follow it with a quantifier.
>
> I can understand that you can create a grammar that excludes it. I'm
> actually really interested to know if anyone knows whether this was a
> deliberate decision and, if so, what the reason is. (And if not --
> should it be considered a (low priority) bug?)
>
> Was it because such patterns often reveal a mistake? Because "\s{6}+"
> has other meanings in different regex syntaxes and the designers didn't
> want confusion? Because it was simpler to parse that way? Because the
> "hey you recognize regular expressions by converting it to a finite
> automaton" story is a lie in most real-world regex implementations (in
> part because they're not actually regular expressions) and repeated
> quantifiers cause problems with the parsing techniques that actually get
> used?
>
You rarely want to repeat a repeated element. It can also result in 
catastrophic
backtracking unless you're _very_ careful.

In many other regex implementations (including mine), "*+", "*+" and
"?+" are possessive quantifiers, much as "??", "*?" and "??" are lazy
quantifiers.

You could, of course, ask why adding "?" after a quantifier doesn't
make it optional, e.g. why r"\s{6}?" doesn't mean the same as
r"(?:\s{6})?", or why r"\s{0,6}?" doesn't mean the same as
r"(?:\s{0,6})?".



More information about the Python-list mailing list