Detecting repeated subsequences of identical items

Thu Apr 21 01:37:38 EDT 2016

On Thu, Apr 21, 2016 at 1:07 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> Now I want to group subsequences. For example, I have:
>
> "ABCABCABCDEABCDEFABCABCABCB"
>
> and I want to group it into repeating subsequences. I can see two ways to
> group it:
>
> ABC ABC ABCDE ABCDE F ABC ABC ABC B
>
> or:
>
> ABC ABC ABC D E A B C D E F ABC ABC ABC B

Interesting. I've *almost* managed to (ab)use re.split for this
purpose. A one-step solution can be done with re.match:

>>> txt = "ABCABCABCDEABCDEFABCABCABCB"
>>> re.match(r'(.+)\1+', txt)
<_sre.SRE_Match object; span=(0, 9), match='ABCABCABC'>

But split then returns only the grouped part:

>>> re.split(r'(.+)\1+', txt)
['', 'ABC', 'DEABCDEF', 'ABC', 'B']

or *all* the grouped parts:

>>> re.split(r'((.+)\2+)', txt)
['', 'ABCABCABC', 'ABC', 'DEABCDEF', 'ABCABCABC', 'ABC', 'B']

There's definitely a partial solution happening here, but I can't
quite make it work.

And no, I don't know if there's a standard name for it.

ChrisA