[Tutor] alternation regex--greedy?

Chris or Leslie Smith smiles at worksmail.net
Fri Nov 4 10:05:36 CET 2005


Hello,

I am writing a script to convert from one font set (used to write in devanagari) to another. This consists of converting certain key sequences to others, e.g. Of --> k. To do the conversion process, I've looked at some of the single-pass multireplacement suggestions posted in the Python Cookbook which build a large regex string with find patterns separated with "|", i.e. a large string of alternations (and then they use a dictionary containing the values of the keys that were put in the regex string). 

In the process, I found something unexpected about regex patterns that use alternation: the match is not greedy. The result depends on which pattern is listed first.

###
>>> import re
>>> pat=re.compile('this|th')
>>> s='thistle'
>>> pat.match(s).group(0)
'this'
>>> pat=re.compile('th|this')
>>> pat.match(s).group(0)
'th'
###


I read the regex docs about the 'match' function and they say that the 'leftmost non-overlapping occurrences of pattern in string' will be returned. I can understand why the pattern 'light|sli' (or 'sli|light') will match 'sli' in text 'slight' (because 'sli' is more to the left than 'light'). But why isn't the greedy result of 'this' obtained for either of the patterns given in the example above (between the ###)? Is this because it is trying to return a pattern that overlaps as little as possible with anything else? 

Basically what this is meaning for me so far is that a single-pass solution to replacing certain text sequences with others cannot always be done unless the patterns to be found are non-overlapping themselves. In the case of 'light|sli' even though 'light' appears first in the alternation, 'sli' will be found with higher priority in sequences that had 'slight' in them since 'sli' is more to the left.

Any comments on the non-greediness of the alternation pattern or on the multi-replacement process in general are appreciated.

/c


More information about the Tutor mailing list