Backreference within a character class
Tim Peters
tim_one at email.msn.com
Fri Feb 25 03:21:15 EST 2000
[Andrew M. Kuchling]
> If you're trying to match 3-character words with the same letters in
> positions 1 and 3, but not 2, then a lookahead negation would do it:
>
> pat = re.compile(r"(.)(?!\1).\1")
>
> The steps here are 1) matches a character 2) assert that the
> backreference \1 doesn't match at this point 3) consume the character,
> because assertions are zero-width and don't consume any characters,
> and 4) match \1. (Alternatively, if the 3-character string is in
> variable S, 'if (S[0] == S[2] and S[0] != S[1])' would do it.)
>
> On a theoretical plane: If you wanted to match general strings of the
> form ABA, where A!=B and A,B are of arbitrary non-zero length, I think
> this isn't possible with regexes (of either Python or Perl varieties),
> because in step 3 you couldn't consume as many characters as were
> matched by the first group. Anyone see a
> clever way I've missed? (Another jeu d'esprit.) You'd have to do it
> by matching the pattern r"(.+)(.+)\1", and then verifying that group 2
> != group 1 in Python code.
How about the obvious <wink> way?
(.+)(?!\1)(.+?)\1
The point being that if the negative lookahead assertion succeeds, \1 can't
match any prefix of the remaining string either, so it doesn't matter how
many chars B sucks up (B can't equal A, else the assertion would have
failed).
Example:
>>> import re
>>> p = re.compile(r"(.+)(?!\1)(.+?)\1")
>>> s = "Stichting Mathematisch Centrum, Amsterdam"
>>> i = 0
>>> while 1:
m = p.search(s, i)
if not m:
break
print "A='%s' B='%s'" % m.groups()
i = m.end(0)
A='ti' B='ch'
A='n' B='g Mathematisch Ce'
A='t' B='rum, Ams'
>>>
Cute: applying that to the paragraph above, it finds A=" suc" at the starts
of " succeeds" and " sucks". Backreferences are scary.
not-to-mention-irregular-ly y'rs - tim
PS: Next try to match ABA where A is not a substring of B. Then where A is
not a substring of the reversal of B <wink>. This kind of thing is easy in
Icon.
PPS: Did you mean to say "and A,B are of [*the same*] arbitrary non-zero
length"? Icon looks better all the time <wink>.
More information about the Python-list
mailing list