regex question

Paul McGuire ptmcg at austin.rr.com
Fri Apr 27 10:50:15 EDT 2007


On Apr 27, 9:10 am, proctor <12cc... at gmail.com> wrote:
> On Apr 27, 1:33 am, Paul McGuire <p... at austin.rr.com> wrote:
>
>
>
>
>
> > On Apr 27, 1:33 am, proctor <12cc... at gmail.com> wrote:
>
> > > hello,
>
> > > i have a regex:  rx_test = re.compile('/x([^x])*x/')
>
> > > which is part of this test program:
>
> > > ============
>
> > > import re
>
> > > rx_test = re.compile('/x([^x])*x/')
>
> > > s = '/xabcx/'
>
> > > if rx_test.findall(s):
> > >         print rx_test.findall(s)
>
> > > ============
>
> > > i expect the output to be ['abc'] however it gives me only the last
> > > single character in the group: ['c']
>
> > > C:\test>python retest.py
> > > ['c']
>
> > > can anyone point out why this is occurring?  i can capture the entire
> > > group by doing this:
>
> > > rx_test = re.compile('/x([^x]+)*x/')
> > > but why isn't the 'star' grabbing the whole group?  and why isn't each
> > > letter 'a', 'b', and 'c' present, either individually, or as a group
> > > (group is expected)?
>
> > > any clarification is appreciated!
>
> > > sincerely,
> > > proctor
>
> > As Josiah already pointed out, the * needs to be inside the grouping
> > parens.
>
> > Since re's do lookahead/backtracking, you can also write:
>
> > rx_test = re.compile('/x(.*?)x/')
>
> > The '?' is there to make sure the .* repetition stops at the first
> > occurrence of x/.
>
> > -- Paul
>
> i am working through an example from the oreilly book mastering
> regular expressions (2nd edition) by jeffrey friedl.  my post was a
> snippet from a regex to match C comments.   every 'x' in the regex
> represents a 'star' in actual usage, so that backslash escaping is not
> needed in the example (on page 275).  it looks like this:
>
> ===========
>
> /x([^x]|x+[^/x])*x+/
>
> it is supposed to match '/x', the opening delimiter, then
>
> (
> either anything that is 'not x',
>
> or,
>
> 'x' one or more times, 'not followed by a slash or an x'
> ) any number of times (the 'star')
>
> followed finally by the closing delimiter.
>
> ===========
>
> this does not seem to work in python the way i understand it should
> from the book, and i simplified the example in my first post to
> concentrate on just one part of the alternation that i felt was not
> acting as expected.
>
> so my question remains, why doesn't the star quantifier seem to grab
> all the data.  isn't findall() intended to return all matches?  i
> would expect either 'abc' or 'a', 'b', 'c' or at least just
> 'a' (because that would be the first match).  why does it give only
> one letter, and at that, the /last/ letter in the sequence??
>
> thanks again for replying!
>
> sincerely,
> proctor- Hide quoted text -
>
> - Show quoted text -

Again, I'll repeat some earlier advice:  you need to move the '*'
inside the parens - you are still leaving it outside.  Also, get in
the habit of using raw literal notation (that is r"slkjdfljf" instead
of "lsjdlfkjs") when defining re strings - you don't have backslash
issues yet, but you will as soon as you start putting real '*'
characters in your expression.

However, when I test this,

restr = r'/x(([^x]|x+[^/])*)x+/'
re_ = re.compile(restr)
print re_.findall("/xabxxcx/ /x123xxx/")

findall now starts to give a tuple for each "comment",

[('abxxc', 'xxc'), ('123xx', 'xx')]

so you have gone beyond my limited re skill, and will need help from
someone else.

But I suggest you add some tests with multiple consecutive 'x'
characters in the middle of your comment, and multiple consecutive 'x'
characters before the trailing comment.  In fact, from my
recollections of trying to implement this type of comment recognizer
by hand a long time ago in a job far, far away, test with both even
and odd numbers of 'x' characters.

-- Paul




More information about the Python-list mailing list