[New-bugs-announce] [issue22353] re.findall() documentation lacks information about finding THE LAST iteration of reoeated capturing group (greedy)

Mateusz Dobrowolny report at bugs.python.org
Sun Sep 7 14:35:32 CEST 2014


New submission from Mateusz Dobrowolny:

Python 3.4.1, Windows.
help(re.findall) shows me:
findall(pattern, string, flags=0)
    Return a list of all non-overlapping matches in the string.

    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.

    Empty matches are included in the result.

It seems like there is missing information regarding greedy groups, i.e. (regular_expression)*
Please take a look at my example:

-------------EXAMPLE-------------
import re

text = 'To configure your editing environment, use the Editor settings page and its child pages. There is also a ' \
       'Quick Switch Scheme command that lets you change color schemes, themes, keymaps, etc. with a couple of ' \
       'keystrokes.'
print('Text to be searched: \n' + text)
print('\nSarching method: re.findall()')

regexp_result = re.findall(r'\w+(\s+\w+)', text)
print('\nRegexp rule: r\'\w+(\s+\w+)\' \nFound: ' + str(regexp_result))
print('This works as expected: findall() returns a list of groups (\s+\w+), and the groups are from non-overlapping matches.')

regexp_result = re.findall(r'\w+(\s+\w+)*', text)
print('\nHow about making the group greedy? Here we go: \nRegexp rule: r\'\w+(\s+\w+)*\' \nFound: ' + str(regexp_result))
print('This is a little bit unexpected for me: findall() returns THE LAST MATCHING group only, parsing from-left-to-righ.')

regexp_result_list = re.findall(r'(\w+(\s+\w+)*)', text)
first_group = list(i for i, j in regexp_result_list)
print('\nThe solution is to put an extra group aroung the whole RE: \nRegexp rule: r\'(\w+(\s+\w+)*)\' \nFound: ' + str(first_group))
print('So finally I can get all strings I am looking for, just like expected from the FINDALL method, by accessing first elements in tuples.')
----------END OF EXAMPLE-------------


I found the solution when practicing on this page:
http://regex101.com/#python
Entering:
REGULAR EXPRESSION: \w+(\s+\w+)*
TEST STRING: To configure your editing environment, use the Editor settings page and its child pages. There is also a Quick Switch Scheme command that lets you change color schemes, themes, keymaps, etc. with a couple of keystrokes.

it showed me on the right side with nice color-coding:
1st Capturing group (\s+\w+)*
Quantifier: Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data




I think some information regarding repeated groups should be included as well in Python documentation.

BTW: I have one extra question.
Searching for 'findall' in this tracker I found this issue:
http://bugs.python.org/issue3384

It looks like information about ordering information is no longer in 3.4.1 documentation. Shouldn't this be there?

Kind Regards

----------
assignee: docs at python
components: Documentation
messages: 226534
nosy: Mateusz.Dobrowolny, docs at python
priority: normal
severity: normal
status: open
title: re.findall() documentation lacks information about finding THE LAST iteration of reoeated capturing group (greedy)
versions: Python 3.4

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue22353>
_______________________________________


More information about the New-bugs-announce mailing list