Backreference within a character class

Tim Peters tim_one at email.msn.com
Fri Feb 25 03:21:15 EST 2000


[Andrew M. Kuchling]
> If you're trying to match 3-character words with the same letters in
> positions 1 and 3, but not 2, then a lookahead negation would do it:
>
> pat = re.compile(r"(.)(?!\1).\1")
>
> The steps here are 1) matches a character 2) assert that the
> backreference \1 doesn't match at this point 3) consume the character,
> because assertions are zero-width and don't consume any characters,
> and 4) match \1.  (Alternatively, if the 3-character string is in
> variable S, 'if (S[0] == S[2] and S[0] != S[1])' would do it.)
>
> On a theoretical plane: If you wanted to match general strings of the
> form ABA, where A!=B and A,B are of arbitrary non-zero length, I think
> this isn't possible with regexes (of either Python or Perl varieties),
> because in step 3 you couldn't consume as many characters as were
> matched by the first group.  Anyone see a
> clever way I've missed?  (Another jeu d'esprit.)  You'd have to do it
> by matching the pattern r"(.+)(.+)\1", and then verifying that group 2
> != group 1 in Python code.

How about the obvious <wink> way?

    (.+)(?!\1)(.+?)\1

The point being that if the negative lookahead assertion succeeds, \1 can't
match any prefix of the remaining string either, so it doesn't matter how
many chars B sucks up (B can't equal A, else the assertion would have
failed).

Example:

>>> import re
>>> p = re.compile(r"(.+)(?!\1)(.+?)\1")
>>> s = "Stichting Mathematisch Centrum, Amsterdam"
>>> i = 0
>>> while 1:
        m = p.search(s, i)
        if not m:
            break
        print "A='%s' B='%s'" % m.groups()
        i = m.end(0)

A='ti' B='ch'
A='n' B='g Mathematisch Ce'
A='t' B='rum, Ams'
>>>

Cute:  applying that to the paragraph above, it finds A=" suc" at the starts
of " succeeds" and " sucks".  Backreferences are scary.

not-to-mention-irregular-ly y'rs  - tim


PS:  Next try to match ABA where A is not a substring of B.  Then where A is
not a substring of the reversal of B <wink>.  This kind of thing is easy in
Icon.

PPS:  Did you mean to say "and A,B are of [*the same*] arbitrary non-zero
length"?  Icon looks better all the time <wink>.






More information about the Python-list mailing list