Regular expression question -- exclude substring

Bengt Richter bokr at oz.net
Mon Nov 7 23:08:06 EST 2005


On Mon, 7 Nov 2005 16:38:11 -0800, James Stroud <jstroud at mbi.ucla.edu> wrote:

>On Monday 07 November 2005 16:18, google at fatherfrost.com wrote:
>> Ya, for some reason your non-greedy "?" doesn't seem to be taking.
>> This works:
>>
>> re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)
>
>The non-greedy is actually acting as expected. This is because non-greedy 
>operators are "forward looking", not "backward looking". So the non-greedy 
>finds the start of the first start-of-the-match it comes accross and then 
>finds the first occurrence of '01' that makes the complete match, otherwise 
>the greedy operator would match .* as much as it could, gobbling up all '01's 
>before the last because these match '.*'. For example:
>
>py> rgx = re.compile(r"(00.*01) target_mark")
>py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
>['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
>py> rgx = re.compile(r"(00.*?01) target_mark")
>py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
>['00 noise1 01 noise2 00 target 01', '00 dowhat 01']
>
>My understanding is that backward looking operators are very resource 
>expensive to implement.
>
If the delimiting strings are fixed, we can use plain python string methods, e.g.,
(not tested beyond what you see ;-)

 >>> s = "00 noise1 01 noise2 00 target 01 target_mark"

 >>> def findit(s, beg='00', end='01', tmk=' target_mark'):
 ...     start = 0
 ...     while True:
 ...         t = s.find(tmk, start)
 ...         if t<0: break
 ...         start = s.rfind(beg, start, t)
 ...         if start<0: break
 ...         e = s.find(end, start, t)
 ...         if e+len(end)==t: # _just_ after
 ...             yield s[start:e+len(end)]
 ...         start = t+len(tmk)
 ...
 >>> list(findit(s))
 ['00 target 01']
 >>> s2 = s + ' garbage noise3 00 almost 01  target_mark 00 success 01 target_mark'
 >>> list(findit(s2))
 ['00 target 01', '00 success 01']

(I didn't enforce exact adjacency the first time, obviously it would be more efficient
to search for end+tmk instead of tmk and back to beg and forward to end ;-)

If there can be spurious target_marks, and tricky matching spans, additional logic may be needed.
Too lazy to think about it ;-)

Regards,
Bengt Richter



More information about the Python-list mailing list