finding repeated data sequences in a column

Thu May 21 03:55:45 EDT 2009

On May 20, 6:53 pm, norseman <norse... at hughes.net> wrote:
> bearophileH... at lycos.com wrote:
> > yadin:
> >> How can I build up a program that tells me that this sequence
> >> 1000028706
> >> 1000028707
> >> 1000028708
> >> is repeated somewhere in the column, and how can i know where?
>
> > Can such patterns nest? That is, can you have a repeated pattern made
> > of an already seen pattern plus something else?
> > If you don't want a complex program, then you may need to specify the
> > problem better.
>
> > You may want something like LZ77 or releated (LZ78, etc):
> >http://en.wikipedia.org/wiki/LZ77
> > This may have a bug:
> >http://code.activestate.com/recipes/117226/
>
> > Bye,
> > bearophile
>
> ============================================
> index on column
> Ndx1 is set to index #1
> Ndx2 is set to index #2
> test Ndx1 against Ndx2
>    if equal write line number and column content to a file
>            (that's two things on one line:  15 1000028706
>                                            283 1000028706 )
>    Ndx1 is set to Ndx2
>    Ndx2 is set to index #next
> loop to test    writing out each duplicate set
>
> Then use the outfile and index on line number
>
> In similar manor, check if line current and next line line numbers are
> sequential.  If so scan forward to match column content of lower line
> number and check first matched column's line number and next for
> sequential.  Print them out if so
>
> everything in outfile has 1 or more duplicates
>
> 4  aa               |--
> 5  bb         |--      |      thus 4/5 match 100/101
> 6  cc            |     |
> .                |     |
> 100 aa           |  |--
> 101 bb         |--
> 102 ddd
> 103 cc                  there is a duplicate but not a sequence
> 200 ff
>
> mark duplicate sequences as tested and proceed on through
>    seq1 may have more than one other seq in file.
>    the progress is from start to finish without looking back
>    thus each step forward has fewer lines to test.
>    marking already knowns eliminates redundant sequence testing.
>
> By subseting on pass1 the expensive testing is greatly reduced.
> If you know your subset data won't exceed memory then the "outfile"
> can be held in memory to speed things up considerably.
>
> Today is: 20090520
> no code
>
> Steve- Hide quoted text -
>
> - Show quoted text -

this is the program...I wrote but is not working
I have a list of valves, and another of pressures;
If I am ask to find out which ones are the valves that are using all
this set of pressures, wanted best pressures
this is the program i wrote but is not working properly, it suppossed
to return in the case
find all the valves that are using pressures 1 "and" 2 "and" 3.
It returns me A, A2, A35....
The correct answer supposed to be A and A2...
if I were asked for pressures 56 and 78 the correct answer supossed to
be valves G and G2...

Valves = ['A','A','A','G', 'G', 'G',
'C','A2','A2','A2','F','G2','G2','G2','A35','A345','A4'] ##valve names
pressures = [1,2,3,4235,56,78,12, 1, 2, 3, 445, 45,56,78,1, 23,7] ##
valve pressures
result = []

bestpress = [1,2,3] ##wanted base pressures
print bestpress,'len bestpress is' , len(bestpress)

print len(Valves)
print len(Valves)
for j in range(len(Valves)):
#for i in range(len(bestpress)):
    #for j in range(len(Valves)):
    for i in range(len(bestpress)-2):
            if pressures [j]== bestpress[i] and bestpress [i+1]
==pressures [j+1] and bestpress [i+2]==pressures [j+2]:
                result.append(Valves[j])
                #i = i+1
                #j = j+1
            # print i, j, bestpress[i]
                print "common PSVs are", result