[Python-bugs-list] [Bug #128830] re is greedy with non-greedy operator

noreply@sourceforge.net noreply@sourceforge.net
Mon, 15 Jan 2001 09:04:26 -0800


Bug #128830, was updated on 2001-Jan-15 03:31
Here is a current snapshot of the bug.

Project: Python
Category: Regular Expressions
Status: Closed
Resolution: Wont Fix
Bug Group: None
Priority: 5
Submitted by: beroul
Assigned to : effbot
Summary: re is greedy with non-greedy operator

Details: In the program below, the pattern "<!--.*?-->" is used to match an
SGML comment.  Despite the use of the non-greedy operator '?', re fails to
find the shortest possible match, which would be the comment preceding
"<!ELEMENT bar..."; instead, it uses all the text preceding "<!ELEMENT
bar..." as the match for the comment pattern.

---

import re

dtd_text = """
<!--
The oranges attribute.
-->
<!ATTLIST foo
oranges CDATA #IMPLIED
>

<!--
The bar element.
-->
<!ELEMENT bar
   (#PCDATA)
>
"""

element_pattern = re.compile(r"(?P<comment><!--.*?-->\s+)"
                             r"(?P<tag_text><!ELEMENT"
                             r"\s+.*?>)",
                             re.DOTALL)

match = element_pattern.search(dtd_text)

if match:
    print "Matched comment:"
    print "----------------"
    print match.group("comment")
    print "Matched tag text:"
    print "-----------------"
    print match.group("tag_text")
else:
    print "No match found."


Follow-Ups:

Date: 2001-Jan-15 09:04
By: effbot

Comment:
Python's RE search method doesn't look for the shortest possible match, it
looks for the *first* possible match.

(or in other words, Python provide Perl-style semantics, not POSIX
semantics)
-------------------------------------------------------

Date: 2001-Jan-15 03:49
By: beroul

Comment:
Actually, here's a minimal example: given the string "<a><b>foo", the
pattern "<.*?>foo" will match the entire string, when it should match only
"<b>foo".


-------------------------------------------------------

For detailed info, follow this link:
http://sourceforge.net/bugs/?func=detailbug&bug_id=128830&group_id=5470