finding repeated data sequences in a column

Wed May 20 13:53:35 EDT 2009

bearophileHUGS at lycos.com wrote:
> yadin:
>> How can I build up a program that tells me that this sequence
>> 1000028706
>> 1000028707
>> 1000028708
>> is repeated somewhere in the column, and how can i know where?
> 
> Can such patterns nest? That is, can you have a repeated pattern made
> of an already seen pattern plus something else?
> If you don't want a complex program, then you may need to specify the
> problem better.
> 
> You may want something like LZ77 or releated (LZ78, etc):
> http://en.wikipedia.org/wiki/LZ77
> This may have a bug:
> http://code.activestate.com/recipes/117226/
> 
> Bye,
> bearophile
============================================
index on column
Ndx1 is set to index #1
Ndx2 is set to index #2
test Ndx1 against Ndx2
   if equal write line number and column content to a file
           (that's two things on one line:  15 1000028706
                                           283 1000028706 )
   Ndx1 is set to Ndx2
   Ndx2 is set to index #next
loop to test    writing out each duplicate set

Then use the outfile and index on line number

In similar manor, check if line current and next line line numbers are 
sequential.  If so scan forward to match column content of lower line 
number and check first matched column's line number and next for 
sequential.  Print them out if so

everything in outfile has 1 or more duplicates

4  aa               |--
5  bb         |--      |      thus 4/5 match 100/101
6  cc            |     |
.                |     |
100 aa           |  |--
101 bb         |--
102 ddd
103 cc                  there is a duplicate but not a sequence
200 ff

mark duplicate sequences as tested and proceed on through
   seq1 may have more than one other seq in file.
   the progress is from start to finish without looking back
   thus each step forward has fewer lines to test.
   marking already knowns eliminates redundant sequence testing.

By subseting on pass1 the expensive testing is greatly reduced.
If you know your subset data won't exceed memory then the "outfile"
can be held in memory to speed things up considerably.

Today is: 20090520
no code

Steve