Searching for Regular Expressions in a string WITH overlap

Matimus mccredie at gmail.com
Thu Nov 20 20:01:22 EST 2008


On Nov 20, 4:31 pm, Ben <bmn... at gmail.com> wrote:
> I apologize in advance for the newbie question.  I'm trying to figure
> out a way to find all of the occurrences of a regular expression in a
> string including the overlapping ones.
>
> For example, given the string 123456789
>
> I'd like to use the RE ((2)|(4))[0-9]{3} to get the following matches:
>
> 2345
> 4567
>
> Here's what I'm trying so far:
> <code>
> #!/usr/bin/env python
>
> import re, repr, sys
>
> string = "123456789"
>
> pattern = '(((2)|(4))[0-9]{3})'
>
> r1 = re.compile(pattern)
>
> stringList = r1.findall(string)
>
> for string in stringList:
>         print "string type is:", type(string)
>         print "string is:", string
> </code>
>
> Which produces:
> <code>
> string type is: <type 'tuple'>
> string is: ('2345', '2', '2', '')
> </code>
>
> I understand that the findall method only returns the non-overlapping
> matches.  I just haven't figured out a function that gives me the
> matches including the overlap.  Can anyone point me in the right
> direction?  I'd also really like to understand why it returns a tuple
> and what the '2', '2' refers to.
>
> Thanks for your help!
> -Ben

'findall' returns a list of matched groups. A group is anything
surrounded by parens. The groups are ordered based on the position of
the opening paren. so, the first result is matching the parens you
have around the whole expression, the second one is matching the
parens that are around '(2)|(4)', the third is matching '(2)', and the
last one is matching '(4)', which is empty.

I don't know of a way to find all overlapping strings automatically. I
would just do something like this:

>>> import re
>>> text = "0123456789"
>>> p = re.compile(r"(?:2|4)[0-9]{3}") # The (?:...) is a way of isolating the values without grouping them.
>>> start = 0
>>> found = []
>>> while True:
...     m = p.search(text, start)
...     if m is None:
...         break
...     start = m.start() + 1
...     found.append(m.group(0))
...
>>> found
['2345', '4567']


Matt



More information about the Python-list mailing list