Detecting repeated subsequences of identical items

Thu Apr 21 02:54:04 EDT 2016

2016-04-21 5:07 GMT+02:00 Steven D'Aprano <steve at pearwood.info>:
> I want to group repeated items in a sequence. For example, I can group
> repeated sequences of a single item at a time using groupby:
>
>
> from itertools import groupby
> for key, group in groupby("AAAABBCDDEEEFFFF"):
>     group = list(group)
>     print(key, "count =", len(group))
>
>
> outputs:
>
> A count = 4
> B count = 2
> C count = 1
> D count = 2
> E count = 3
> F count = 4
>
>
> Now I want to group subsequences. For example, I have:
>
> "ABCABCABCDEABCDEFABCABCABCB"
>
> and I want to group it into repeating subsequences. I can see two ways to
> group it:
>
> ABC ABC ABCDE ABCDE F ABC ABC ABC B
>
> giving counts:
>
> (ABC) count = 2
> (ABCDE) count = 2
> F count = 1
> (ABC) count = 3
> B repeats 1 time
>
>
> or:
>
> ABC ABC ABC D E A B C D E F ABC ABC ABC B
>
> giving counts:
>
> (ABC) count = 3
> D count = 1
> E count = 1
> A count = 1
> B count = 1
> C count = 1
> D count = 1
> E count = 1
> F count = 1
> (ABC) count = 3
> B count = 1
>
>
>
> How can I do this? Does this problem have a standard name and/or solution?
>
>
>
>
> --
> Steven
>
> --
> https://mail.python.org/mailman/listinfo/python-list

Hi,
if I am not missing something, the latter form of grouping might be
achieved with the following regex:

t="ABCABCABCDEABCDEFABCABCABCB"
grouped = re.findall(r"((?:(\w+?)\2+)|\w+?)", t)
print(grouped)
for grp, subseq in grouped:
    if subseq:
        print(subseq, grp.count(subseq))
    else:
        print(grp, "1")

the printed output is:

[('ABCABCABC', 'ABC'), ('D', ''), ('E', ''), ('A', ''), ('B', ''),
('C', ''), ('D', ''), ('E', ''), ('F', ''), ('ABCABCABC', 'ABC'),
('B', '')]
ABC 3
D 1
E 1
A 1
B 1
C 1
D 1
E 1
F 1
ABC 3
B 1

The former one seems to be more tricky...

hth,
   vbr