Extracting repeated words

Ian Kelly ian.g.kelly at gmail.com
Fri Apr 1 18:42:51 EDT 2011


On Fri, Apr 1, 2011 at 2:54 PM, candide <candide at free.invalid> wrote:
> Another question relative to regular expressions.
>
> How to extract all word duplicates in a given text by use of regular
> expression methods ?  To make the question concrete, if the text is
>
> ------------------
> Now is better than never.
> Although never is often better than *right* now.
> ------------------
>
> duplicates are :
>
> ------------------------
> better is now than never
> ------------------------
>
> Some code can solve the question, for instance
>
> # ------------------
> import re
>
> regexp=r"\w+"
>
> c=re.compile(regexp, re.IGNORECASE)
>
> text="""
> Now is better than never.
> Although never is often better than *right* now."""
>
> z=[s.lower() for s in c.findall(text)]
>
> for d in set([s for s in z if z.count(s)>1]):
>    print d,
> # ------------------
>
> but I'm in search of "plain" re code.

You could use a look-ahead assertion with a captured group:

>>> regexp = r'\b(?P<dup>\w+)\b(?=.+\b(?P=dup)\b)'
>>> c = re.compile(regexp, re.IGNORECASE | re.DOTALL)
>>> c.findall(text)

But note that this is computationally expensive.  The regex that you
posted is probably more efficient if you use a collections.Counter
object instead of z.count.

Cheers,
Ian



More information about the Python-list mailing list