[issue35859] Capture behavior depends on the order of an alternation

Fri Mar 1 08:31:17 EST 2019

Ma Lin <malincns at 163.com> added the comment:

The PR11756 is prepared.
I force-pushed the patch in four steps, hope you can review it easier:
https://github.com/python/cpython/pull/11756/commits

🔴 Step 1, test-cases

Show the wrong behaviors before this fix, the corresponding test-case will be updated in next steps.

🔴 Step 2, MARK_PUSH(lastmark) macro bug

This bug was reported in this issue.

MARK_PUSH(lastmark) macro didn't protect MARK-0 if it was the only available mark.

Note that if skip this step and apply Step 3, it happens to pass the test. But this is indeed a bug that should be fixed.

🔴 Step 3, jump JUMP_MIN_UNTIL_3 needs LASTMARK_SAVE() and MARK_PUSH()

This bug was triggered by a pattern which provided by Serhiy Storchaka:
>>> re.match(r'(ab?)*?b', 'ab').group(1)
''
Expected result: 'a'

This is due to the lack of protection before JUMP.
sre has these JUMPs, the current protection is written in [], the purpose of this JUMP is written in ().

in op SRE_OP_MAX_UNTIL 🔹 (...)*
JUMP_MAX_UNTIL_1 [no protect] (try a repeat, to min limit of repeat)
JUMP_MAX_UNTIL_2 [LASTMARK_SAVE, MARK_PUSH] (try a repeat, until max limit of repeat)
JUMP_MAX_UNTIL_3 [no protect] (try the tail of this repeat)

in op SRE_OP_MIN_UNTIL 🔹 (...)*?
JUMP_MIN_UNTIL_1 [no protect] (try a repeat, to min limit of repeat)
JUMP_MIN_UNTIL_2 [LASTMARK_SAVE] (try the tail of this repeat)
JUMP_MIN_UNTIL_3 [no protect] (try a repeat, until max limit of repeat)

in op SRE_OP_REPEAT_ONE 🔹 .*
JUMP_REPEAT_ONE_1 [LASTMARK_SAVE] (try the tail of this repeat)
JUMP_REPEAT_ONE_2 [LASTMARK_SAVE] (try the tail of this repeat)

in op SRE_OP_MIN_REPEAT_ONE 🔹 .*?
JUMP_MIN_REPEAT_ONE [LASTMARK_SAVE] (try the tail of this repeat) 

in op SRE_OP_BRANCH 🔹 ...|...
JUMP_BRANCH [LASTMARK_SAVE in any case, MARK_PUSH if in a repeat] (try a branch)

in op SRE_OP_ASSERT 🔹 (?=...)
JUMP_ASSERT [no protect] (try to match a sub-pattern)

in op SRE_OP_ASSERT_NOT 🔹 (?!=...)
JUMP_ASSERT_NOT [no protect] (try to match a sub-pattern)

These protections have not been changed since commit caf1c9dfe779 (2003-4-27).
Why it came out like this? There is a note in the message of commit be733ee7fb7e (2003-4-20):

> Gustavo Niemeyer wrote:
> As a note. It seems that there are other places that require the
> "protection" of LASTMARK_SAVE()/LASTMARK_RESTORE(), and are just
> waiting for someone to find how to break them. Particularly, I
> believe that every recursion of SRE_MATCH() should be protected
> by these macros. I won't do that right now since I'm not completely
> sure about this, and we don't have much time for testing until
> the next release.

Now we found some test-cases to break them, it seems JUMP_MIN_UNTIL_3 should be protected.

I tried to summarize a rule of protection:

if is_backtrack_point: # may backtrack if the tail fail
    LASTMARK_SAVE()
elif is_repeat_body:   # is a sub-pattern inside (...)* or (...)*?
    LASTMARK_SAVE()
    MARK_PUSH()

I did some experiments, it seems this rule works.
Since JUMP_MIN_UNTIL_3 is a repeat body, it needs protection, same as JUMP_MAX_UNTIL_2.

With this rule, JUMP_MAX_UNTIL_3 should LASTMARK_SAVE() as well, but this is not needed. sre uses stack to simulate recursive call, in fact JUMP_MAX_UNTIL_3 is protected by JUMP_MAX_UNTIL_2 from the previous SRE_OP_MAX_UNTIL execution.

🔴 Step 4, JUMP_ASSERT_NOT needs LASTMARK_SAVE()

This bug is about negative assertion, initially reported in issue725149, but they didn't fix it for some reasons:

> Gustavo Niemeyer wrote:
> it changes the current behavior, which is also compatible to how
> perl works.
> ...
> This way, we can think further about this, and look for an elegant
> solution to fix that support, certainly including some algorithm
> to check for half-marked groups.

IMO we can fix it, here is two test-cases:

Case-A, negative assertion:
re.match(r'(?!(..)c)', 'ab').group(1)

Case-B, negative assertion in a repeat:
re.match(r'(?:(?!(ab)c).)*', 'ab').group(1)

                    Case-A         Case-B
PHP 7.3.2           NULL           NULL
Java 11.0.2         null           null
Perl 5.26.1         "ab"           "ab"
Ruby 2.6.1          nil            nil
Go 1.12             doesn't support lookaround
Rust 1.32.0         doesn't support lookaround
Node.js 10.15.1     undefined      undefined
regex 2019.2.21     None           None
re before patch     "ab"           "b"
re after patch      None           None

In Case-A, sre looks compatible with Perl.
But in Case-B, sre is definitely wrong, so we can follow the mainstream behavior.

🔴 Interesting sidelights 1

Found a Perl bug:

perl -le 'print defined($1)?"\"$1\"":"undef",",",defined($2)?"\"$2\"":"undef" if "ab" =~ /((ab?)*?b)/'
"ab",undef

Expected result: "ab","a"

All other engines (except unpatched sre) return the correct result.

🔴 Interesting sidelights 2

>>> re.match(r'(((a)|b)*)', 'ab').groups()
('ab', 'b', 'a')

Maybe the correct result is ('ab', 'b', None), only Node.js 10.15.1 returns this.
All other engines return the same result as sre.

----------
Added file: https://bugs.python.org/file48181/test_scripts_for_7_engines.zip

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue35859>
_______________________________________