Regular Expression: Matching substring

John Machin sjmachin at lexicon.net
Thu Apr 13 02:17:19 EDT 2006


On 13/04/2006 12:33 PM, Kevin CH wrote:
> Hi,
> 
> I'm currently running into a confusion on regex and hopefully you guys
> can clear it up for me.
> 
> Suppose I have a regular expression (0|(1(01*0)*1))* and two test
> strings: 110_1011101_ and _101101_1. (The underscores are not part of
> the string.  They are added to show that both string has a substring
> that matches the pattern.)  Applying a match() function on the first
> string returns true while false for the second.

Perhaps you are using grep, or you have stumbled on the old deprecated 
"regex" module and are using that instead of the "re" module. Perhaps 
not as you are using only 2 plain vanilla RE operations which should 
work the same way everywhere. Perhaps you are having trouble with 
search() versus match() -- if so, read the section on this topic in the 
re docs. It's rather hard to tell what you are doing without seeing the 
code you are using.

>  The difference is the
> first one has unmatched chunk in the beginning

With re's match(), the whole string matches.

> while the second at the
> end.

With re's match(), the part you marked with underscores (at the 
*beginning*) matches.


>  How's the regex rule work here?

Let's abbreviate your pattern as (0|X)*
This means 0 or more occurrences of strings that match either 0 or X.

Case 1 gives us 11 matching X [it's a 1 followed by zero occurrences of 
(01*0) followed by a 1], then a 0, then 1011101 matching X [it's a 1 
foll. by 1 occ. of (01110) followed by a 1].

Case 2 gives us 101101 matching X [it's a 1 foll. by 1 occ of (0110) 
foll by a 1] -- then there's a 1 that doesn't match anything.

Here's some code and its output:

C:\junk>type kevinch.py
import re

rx = re.compile(r"(0|(1(01*0)*1))*")

def doit(n, s):
     print "Case", n
     m = rx.match(s)
     if m:
         print "0123456789"
         print s
         for k in range(4):
             print "span(%d) -> %r" % (k, m.span(k))
     else:
         print "... no match"

s1 = "110_1011101_".replace('_', '')
s2 = "_101101_1".replace('_', '')
doit(1, s1)
doit(2, s2)

C:\junk>kevinch.py
Case 1
0123456789
1101011101
span(0) -> (0, 10)
span(1) -> (3, 10)
span(2) -> (3, 10)
span(3) -> (4, 9)
Case 2
0123456789
1011011
span(0) -> (0, 6)
span(1) -> (0, 6)
span(2) -> (0, 6)
span(3) -> (1, 5)

HTH,
John



More information about the Python-list mailing list