Overlapping Regular Expression Matches With findall()

Thu Dec 15 16:31:58 EST 2005

On Thu, 15 Dec 2005 20:33:42 +0000, Simon Brunning <simon at brunningonline.net> wrote:

>On 15 Dec 2005 12:26:07 -0800, Mystilleef <mystilleef at gmail.com> wrote:
>> I want a pattern that scans the entire string but avoids
>> returning duplicate matches. For example "cat", "cate",
>> "cater" may all well be valid matches, but I don't want
>> duplicate matches of any of them. I know I can filter the
>> list containing found matches myself, but that is somewhat
>> expensive for a list containing thousands of matches.
>
>Probably the cheapest way of de-duping the list would be to dump it
>straight into a set, provided that you aren't concerned about the
>order.
>
Or if concerned, maybe try a combination like:

 >>> s = """\
 ... I want a pattern that scans the entire string but avoids
 ... returning duplicate matches. For example "cat", "cate",
 ... "cater" may all well be valid matches, but I don't want
 ... duplicate matches of any of them. I know I can filter the
 ... list containing found matches myself, but that is somewhat
 ... expensive for a list containing thousands of matches.
 ... """
 >>> import re
 >>> rxo = re.compile(r'cat(?:er|e)?')
 >>> rxo.findall(s)
 ['cate', 'cat', 'cate', 'cater', 'cate']
 >>> seen = set()
 >>> [w for w in (m.group(0) for m in rxo.finditer(s)) if w not in seen and not seen.add(w)]
 ['cate', 'cat', 'cater']

BTW, note to put longer ambiguous match first in re, e.g., not r'cat(?:e|er)?') for above.

Regards,
Bengt Richter