[Patches] [ python-Patches-712900 ] sre fixes for lastindex and minimizing repeats+assertions
SourceForge.net
noreply@sourceforge.net
Tue, 22 Apr 2003 12:21:46 -0700
Patches item #712900, was opened at 2003-03-31 10:54
Message generated for change (Comment added) made by glchapman
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=712900&group_id=5470
Category: Library (Lib)
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Greg Chapman (glchapman)
Assigned to: Nobody/Anonymous (nobody)
Summary: sre fixes for lastindex and minimizing repeats+assertions
Initial Comment:
The attached patch fixes two bugs in _sre.c; it also
does a bit of reorganization.
First the bugs. 672491 points out that lastindex is
calculated differently in 2.3 than in previous versions.
This patch restores the previous behavior. Since
lastindex cannot be restored (when backtracking) from
lastmark alone, it is now saved and restored
independently (by the LASTMARK_SAVE and
RESTORE macros).
The second bug appears when minimizing repeats are
combined with assertions:
>>> re.match('([ab]*?)(?=(b)?)c', 'abc').groups()
('ab', 'b')
The second group should be None, since the 'b' is
consumed by the first group. To fix this, it is necessary
to save lastmark before attempting to match the tail in
OP_MIN_UNTIL and to restore it if the tail fails to match.
The reorganization has to do with the handling of the
SRE_STATE's lastmark and mark array. The mark
array tracks the start and end of capturing groups;
lastmark is the highest index in the array so far
encountered. Previously, whenever lastmark was
restored back to a lower value (in 2.3a2 this is done in
the lastmark_restore function), the tail of the mark array
was NULLed out (using memset). This patch adopts the
rule that all indexes greater than lastmark are invalid, so
restoring lastmark does not also require clearing the
tail. To ensure that indexes <= lastmark have valid
pointers, OP_MARK checks if lastmark is being
increased by more than one; if so, it NULLs out the
intervening pointers. This rule also required changes to
the GROUPREF opcodes and the state_getslice
function to ensure that they do not access indexes
greater than lastmark. For consistency, lastmark is
now initialized to –1, to indicate that no entries in the
mark array are valid.
Needless to say, the reorganization is not necessary to
fix the bugs; it may be a bad idea. It seems to be
marginally faster than a version that fixes the bugs but is
similar to the current code (including a memset inside
the LASTMARK_RESTORE macro).
One other thing. I have removed a test for string ==
Py_None from state_getslice, since I can’t find any way
for string to be Py_None at that point (string is always
the object providing the text to be searched; if it were
Py_None, an exception should be raised by the
getstring function called by state_init). Perhaps I
missed something?
----------------------------------------------------------------------
>Comment By: Greg Chapman (glchapman)
Date: 2003-04-22 11:21
Message:
Logged In: YES
user_id=86307
Gustavo, in my understanding, it doesn't matter if the first part
of MAX_UNTIL is protected because if its call to
SRE_MATCH returns 0, it (the first part of MAX_UNTIL)
simply returns 0 to its caller. If that caller is a backtrack
point (a point which may resume matching even though a call
to SRE_MATCH returned 0), it should have already saved
lastmark, etc. and so should be ready to restore them to its
saved values. If that caller is not a backtrack point, it is
either the original invocation of SRE_MATCH (in which case it
doesn't matter what is in the state since the match has failed)
or it is someplace which is going to return 0 to its caller.
FYI, I have uploaded a couple of more sre bug
reports/patches in the last couple of days; I'd appreciate it if
you look at them.
http://www.python.org/sf/725106
http://www.python.org/sf/725149
In particular, the second patch changes ASSERT_NOT so
that (I hope) it no longer violates the assumptions above.
----------------------------------------------------------------------
Comment By: Gustavo Niemeyer (niemeyer)
Date: 2003-04-22 07:23
Message:
Logged In: YES
user_id=7887
Greg, what I meant is that I *think* that every SRE_MATCH()
recursion with *failing* possibilities should be protected
by SAVE/RESTORE(), as it can potentially change
lastmark/lastindex before returning, depending on what is
being executed.
Do you belive the first part of MAX_UNTIL shouldn't be
protected, even though it could match a subpattern including
marks and fail? If so, can you explain?
OTOH, I agree with you that MAX_UNTIL and MIN_UNTIL are not
coherent. I'll fix that in the safe side for now, including
mark protection in both. If we end up discovering it was not
necessary, we simply drop it.
----------------------------------------------------------------------
Comment By: Greg Chapman (glchapman)
Date: 2003-04-20 04:33
Message:
Logged In: YES
user_id=86307
Gustavo, thanks for reworking and applying the patch. And
thanks for catching the bug in the MIN_REPEAT_ONE where I
was not calling lastmark_restore in the right place; it does need
to be inside the loop (though I think it could be at the bottom --
it's not necessary if SRE_COUNT returns 0).
I note in your checkin message you are concerned about
whether all calls to SRE_MATCH need to be protected by
LASTMATCH_SAVE/RESTORE. These should only be
necessary where SRE_MATCH returns 0 but nevertheless
matching continues (i.e., a backtrack point or an alternative). So
they are clearly not necessary for ASSERT and REPEAT.
ASSERT_NOT is an interesting case. Right now, a capturing
group (inside the negative assertion) which matches before the
subpattern fails is reported as having matched; this is the way
Perl works. You could argue that it would be more consistent for
these groups always to report as unmatched (Jeffrey Friedl
implies as much in the first edition of his book), in which case
the ASSERT_NOT should be surrounded by
LASTMARK_SAVE/RESTORE.
Anyway, it should not be necessary to protect the first part of the
MAX_UNTIL and MIN_UNTIL opcodes (the part which assures
the repeat has matched the minimum number of times); if this
part fails, it causes this invocation of SRE_MATCH to return 0. I
note that you have protected this part of MAX_UNTIL but not
MIN_UNTIL. We should probably be consistent here and remove
the protection from MAX_UNTIL.
----------------------------------------------------------------------
Comment By: Gustavo Niemeyer (niemeyer)
Date: 2003-04-19 23:47
Message:
Logged In: YES
user_id=7887
Greg, thank you very much for the patch. I've included the
fixes without introducing the reorganization mentioned, for
the sake of stability. Also, the second fix mentioned in
your report don't fix the mentioned problem anymore, because
of the change introduced by patch #720991. The new fix
wasn't complicated though, and is included as well.
About the reorganization, I don't have a strong opinion
about it, and I'm also not sure about the real impact the
new "mark reset" code will cause in the module. IMO we need
to test it further.
Andrew, that explains why the second fix greg mentions
doesn't work anymore. Thanks for mentioning it.
The changes were applied as:
Modules/_sre.c: 2.91
Lib/test/re_tests.py: 1.35
Lib/test/test_sre.py: 1.41
----------------------------------------------------------------------
Comment By: Andrew I MacIntyre (aimacintyre)
Date: 2003-04-19 21:08
Message:
Logged In: YES
user_id=250749
Note that Guido checked in patch #720991, which incorporates
some of Greg's work (at least that's what patch author Gary
Herron notes) on April 14.
That checkin makes _sre.c (more) sensitive to optimisation
settings for me (see my comment against that patch).
----------------------------------------------------------------------
Comment By: Gustavo Niemeyer (niemeyer)
Date: 2003-04-19 18:35
Message:
Logged In: YES
user_id=7887
Backing out the changes was not enough. I'll evaluate your
solution.
----------------------------------------------------------------------
Comment By: Gustavo Niemeyer (niemeyer)
Date: 2003-04-19 17:51
Message:
Logged In: YES
user_id=7887
Greg, thanks for mentioning that. I have fixed the first bug
by myself, by backing out some of the changes I made in
_sre.c:2.84. Can you please check it out and verify if it
has the behavior you'd expect now?
----------------------------------------------------------------------
Comment By: Gustavo Niemeyer (niemeyer)
Date: 2003-04-19 15:53
Message:
Logged In: YES
user_id=7887
Greg, this patch doesn't seem to compile in the latest CVS.
Are you able to update it?
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=712900&group_id=5470