Regexp finditer() fails to match some non-overlapping matches?

John Machin sjmachin at lexicon.net
Sat May 3 18:12:17 EDT 2003


philipj at telia.com (Philip Jägenstedt) wrote in message news:<313b626e.0305031036.1fcfeab2 at posting.google.com>...
> import re
> str="__Bullet lists__"
> pattern = r"^|__.+__"
> rules = re.compile(pattern)
> for m in rules.finditer(str):
>     print m.start(), m.end(), m.group()
> 
> In this case, the string "__Bullet lists__" will not be matched,
> because there is the zero-length match before it.
> 
> So what it boils down to is: why doesn't finditer() match both the
> beginning of the line, and some other thing that lives at the
> beginning of the line?
> 
> For example:
> 
> |_|_|b|o|l|d|_|_|
> 0 1 2 3 4 5 6 7 8
> 
> I'd like to have a zero-length match (0-0) since there are no # or *
> characters, and then the 0-8 match for __bold__. But, I cannot see how
> to do it.
> 
> I have the problem using Debian GNU/Linux testing, with python 2.2.1.
> If any other information is needed, do ask!

You will have the problem on any OS with any version of Python, with
not only finditer() but also with findall() and sub() ... and, I'll
cheerfully wager without having ever used them, the same applies to
the corresponding facilities in Perl, Ruby, etc etc. Likewise in any
text editor that supports regular expressions.  Try
g/your_pattern/s//foo/g in vi and see how many foos you get at the
start of each line.

The "problem" is that implementors *don't* regard your examples as
"non-overlapping" -- they start at the same position in the text. An
RE searching engine will always advance its input position by one
character after a zero-length match otherwise it would loop endlessly.

Your solution should be relatively simple: abandon the canned loop of
finditer(), write your own loop which searches for the next occurrence
of one of your multiple patterns of interest, then examines which one
or more are actually present starting at the match point. Be careful
about advancing the search point after a match.

You may like to submit a documentation enhancement request, to the
effect that "non-overlapping" needs clarification.

HTH,
John




More information about the Python-list mailing list