[Tutor] alternation regex--greedy?

Kent Johnson kent37 at tds.net
Fri Nov 4 19:38:35 CET 2005


Chris or Leslie Smith wrote:
> I found something unexpected about regex patterns
> that use alternation: the match is not greedy. The result depends on
> which pattern is listed first.

Yes, this is exactly as described in the docs for "|":
As the target string is scanned, REs separated by "|" are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the "|" operator is never greedy.
> 
> ###
> 
>>>> import re pat=re.compile('this|th') s='thistle' 
>>>> pat.match(s).group(0)
> 
> 'this'
> 
>>>> pat=re.compile('th|this') pat.match(s).group(0)
> 
> 'th' ###
> 
> 
> I read the regex docs about the 'match' function and they say that
> the 'leftmost non-overlapping occurrences of pattern in string' will
> be returned. I can understand why the pattern 'light|sli' (or
> 'sli|light') will match 'sli' in text 'slight' (because 'sli' is more
> to the left than 'light'). But why isn't the greedy result of 'this'
> obtained for either of the patterns given in the example above
> (between the ###)? Is this because it is trying to return a pattern
> that overlaps as little as possible with anything else?

The first pattern does return 'this'.

> Basically what this is meaning for me so far is that a single-pass
> solution to replacing certain text sequences with others cannot
> always be done unless the patterns to be found are non-overlapping
> themselves. In the case of 'light|sli' even though 'light' appears
> first in the alternation, 'sli' will be found with higher priority in
> sequences that had 'slight' in them since 'sli' is more to the left.

If you want to replace 'light' rather than 'sli' in 'slight' then a multi-pass replace is the simplest solution. You could make a pattern that will do this but it won't scale well to many replacements.

You can probably divide your replacements into a few groups such that none of the patterns in group A can overlap, and they all must be done before the replacements in group B, etc. Then apply the replacements in groups rather than each separately.

Kent

-- 
http://www.kentsjohnson.com



More information about the Tutor mailing list