Detecting repeated subsequences of identical items

Steven D'Aprano steve+comp.lang.python at pearwood.info
Thu Apr 21 04:05:40 EDT 2016


On Thursday 21 April 2016 16:35, Michael Selik wrote:

> On Wed, Apr 20, 2016 at 11:11 PM Steven D'Aprano <steve at pearwood.info>
> wrote:
> 
>> I want to group [repeated] subsequences. For example, I have:
>> "ABCABCABCDEABCDEFABCABCABCB"
>> and I want to group it into repeating subsequences. I can see two
>> ways... How can I do this? Does this problem have a standard name and/or
>> solution?
>>
> 
> I'm not aware of a standard name. This sounds like an unsupervised
> learning problem. There's no objectively correct answer unless you add
> more specificity to the problem statement.
> 
> Regexes may sound tempting at first, 

Ah, I may have mislead you all. I cannot use regexes, since the *actual* 
sequences I'm working on are sequences (lists, tuples, etc) or iterators of 
arbitrary items. The items themselves may be strings, but the sequence 
itself is definitely not a string. I just showed a string for convenience. 
Sorry about that.

So no regexes.


> but because a repeating subsequence
> may have nested repeating subsequences and this can go on infinitely, I
> think we at least need a push-down automata.

Fortunately, for my *immediate* problem, I would be good with some fairly 
arbitrary restrictions on subsequence detection.


> I checked out some links for clustering algorithms that work on series
> subsequences and I found some fun results.
> 
> Clustering is meaningless!
> http://www.cs.ucr.edu/~eamonn/meaningless.pdf
> 
> I think you're in "no free lunch" territory. "Clustering of subsequence
> time series remains an open issue in time series clustering"
> http://www.hindawi.com/journals/tswj/2014/312521/
> 
> Any more detail on the problem to add constraints?

The specific problem I am trying to solve is that I have a sequence of 
strings (in this case, error messages from a Python traceback) and I'm 
looking for repeated groups that may indicate mutually recursive calls. E.g. 
suppose I have a function f which calls g, and g calls h, and h calls f 
again, and there's an exception, you will see a traceback in part:


  File "<stdin>", line 2, in f
  File "<stdin>", line 5, in g
  File "<stdin>", line 9, in h
  File "<stdin>", line 2, in f
  File "<stdin>", line 5, in g
  File "<stdin>", line 9, in h
  File "<stdin>", line 2, in f
  File "<stdin>", line 5, in g
  File "<stdin>", line 9, in h
  File "<stdin>", line 7, in f
  File "<stdin>", line 5, in g
  File "<stdin>", line 9, in h

etc. Note that I only care about lines which are identical, e.g. if the line 
numbers differ (as in the last call to f), they will be treated as distinct 
items. So I'd like to group the above as:

  File "<stdin>", line 2, in f
  File "<stdin>", line 5, in g
  File "<stdin>", line 9, in h
  *** above 3 calls repeated 3 times ***
  File "<stdin>", line 7, in f
  File "<stdin>", line 5, in g
  File "<stdin>", line 9, in h



-- 
Steve




More information about the Python-list mailing list