[Tutor] regular expression question

Tue Apr 28 11:45:53 CEST 2009

Le Tue, 28 Apr 2009 11:06:16 +0200,
Marek Spociński at go2.pl,  Poland <marek_sp at 10g.pl> s'exprima ainsi:

> > Hello,
> > 
> > The following code returns 'abc123abc45abc789jk'. How do I revise the
> > pattern so that the return value will be 'abc789jk'? In other words, I
> > want to find the pattern 'abc' that is closest to 'jk'. Here the string
> > '123', '45' and '789' are just examples. They are actually quite
> > different in the string that I'm working with. 
> > 
> > import re
> > s = 'abc123abc45abc789jk'
> > p = r'abc.+jk'
> > lst = re.findall(p, s)
> > print lst[0]
> 
> I suggest using r'abc.+?jk' instead.
> 
> the additional ? makes the preceeding '.+' non-greedy so instead of
> matching as long string as it can it matches as short string as possible.

Non-greedy repetition will not work in this case, I guess:

from re import compile as Pattern
s = 'abc123abc45abc789jk'
p = Pattern(r'abc.+?jk')
print p.match(s).group()
==>
abc123abc45abc789jk

(Someone explain why?)

My solution would be to explicitely exclude 'abc' from the sequence of chars matched by '.+'. To do this, use negative lookahead (?!...) before '.':
p = Pattern(r'(abc((?!abc).)+jk)')
print p.findall(s)
==>
[('abc789jk', '9')]

But it's not exactly what you want. Because the internal () needed to express exclusion will be considered by findall as a group to be returned, so that you also get the last char matched in there.
To avoid that, use non-grouping parens (?:...). This also avoids the need for parens around the whole format:
p = Pattern(r'abc(?:(?!abc).)+jk')
print p.findall(s)
['abc789jk']

Denis
------
la vita e estrany