[docs] [issue19055] Clarify docs for re module: why * does not match as many repetitions as possible.
Akira Li
report at bugs.python.org
Mon Aug 11 11:27:32 CEST 2014
Akira Li added the comment:
tl;dr: added patch that clarifies Python re behavior. Please, review.
---
The documented behavior is not clear: why (a|ab)* is not equivalent to
(a|ab)(a|ab) for aba if the docs say "as many repetitions as are
possible"?
And it is not obvious (it is not the only possible behavior) e.g.,
POSIX [1] expects the longest match, PCRE [2] group may be atomic
(/possesive quantifiers), or there is ungreedy repetition:
$ echo abac | grep -o '\(a\|ab\)* # Basic Regular Expression
aba
$ echo abac | grep -Eo '(a|ab)*' # Extended Regular Expression
aba
$ echo abac | grep -Po '(a|ab)*' # PCRE (like Python regexes)
a
a
$ echo abac | grep -Po '(a|ab)(a|ab)' # PCRE (two repeats)
aba
$ echo abac | grep -Po '(a|ab)*c' # PCRE (re-eval to match the rest)
abac
$ echo abAc | grep -iPo '(a|ab)*+c' # PCRE (possessive quantifiers)
Ac
$ echo abac | grep -Po '(a|ab)*?' # PCRE (ungreedy, zero repeats)
# matches nothing
where BREs, EREs are defined in [1] that says:
If the pattern permits a variable number of matching characters and
thus there is more than one such sequence starting at that point,
*the longest such sequence is matched*.
and:
Consistent with the whole match being the longest of the leftmost
matches, each subpattern, from left to right, *shall match the
longest possible string.*
Python regexes work like PCRE [2] that says:
The matching process tries each alternative in turn, from left to
right, and *the first one that succeeds is used*. If the
alternatives are within a subpattern (defined below), "succeeds"
means matching the rest of the main pattern as well as the
alternative in the subpattern.
It explains why (a|ab)* matches only a in aba ("the first one that
succeeds") and why at the same time (a|ab)*c matches abac: (a|ab) may
match ab in the latter case because the main pattern would fail
otherwise i.e., the advanced definition of "succeeds" is used:
""succeeds" means matching the rest of the main pattern ...".
Python docs [3] are similar but do not contain the "advanced"
"succeeds" definition:
REs separated by ``'|'`` are tried from left to right. When one
pattern completely matches, that branch is accepted. This means
that once ``A`` matches, ``B`` will not be tested further, even if
it would produce a longer overall match. In other words, the
``'|'`` operator is never greedy.
It explains why (a|ab) matches a in aba.
'*' behavior [2]:
By default, the quantifiers are "greedy", that is, they match as
much as possible (up to the maximum number of permitted times),
without causing the rest of the pattern to fail.
and again Python docs [3] are similar:
``'*'`` Causes the resulting RE to match 0 or more repetitions of
the preceding RE, as many repetitions as are possible.
It implies that (a|ab)* should be equivalent to (a|ab)(a|ab) to match
aba -- TWO REPETITIONS ARE MORE THAN ONE: "as many repetitions as are
possible". But it matches only 'a' i.e., it works as (a|ab) -- only
one repetition instead of two. In reality, (a|ab)* along works like a*.
I've uploaded a documentation patch that makes Python documentation
for '|' more similar to PCRE and clarifies '*' definition as R. David
Murray suggested in msg198126. Please, review.
[1] http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html
[2] http://man7.org/linux/man-pages/man3/pcrepattern.3.html
[3] http://hg.python.org/cpython/file/18a311479e8b/Doc/library/re.rst
----------
components: -Regular Expressions
keywords: +patch
nosy: +akira
title: Regular expressions: * does not match as many repetitions as possible. -> Clarify docs for re module: why * does not match as many repetitions as possible.
versions: +Python 3.5 -Python 2.7, Python 3.3
Added file: http://bugs.python.org/file36340/re-docs-repetitions.patch
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue19055>
_______________________________________
More information about the docs
mailing list