finding repeated data sequences in a column

Thu May 21 19:27:07 EDT 2009

yadin wrote:
> On May 20, 6:53 pm, norseman <norse... at hughes.net> wrote:
>> bearophileH... at lycos.com wrote:
>>> yadin:
>>>> How can I build up a program that tells me that this sequence
>>>> 1000028706
>>>> 1000028707
>>>> 1000028708
>>>> is repeated somewhere in the column, and how can i know where?
>>> Can such patterns nest? That is, can you have a repeated pattern made
>>> of an already seen pattern plus something else?
>>> If you don't want a complex program, then you may need to specify the
>>> problem better.
>>> You may want something like LZ77 or releated (LZ78, etc):
>>> http://en.wikipedia.org/wiki/LZ77
>>> This may have a bug:
>>> http://code.activestate.com/recipes/117226/
>>> Bye,
>>> bearophile
>> ============================================
>> index on column
>> Ndx1 is set to index #1
>> Ndx2 is set to index #2
>> test Ndx1 against Ndx2
>>    if equal write line number and column content to a file
>>            (that's two things on one line:  15 1000028706
>>                                            283 1000028706 )
>>    Ndx1 is set to Ndx2
>>    Ndx2 is set to index #next
>> loop to test    writing out each duplicate set
>>
>> Then use the outfile and index on line number
>>
>> In similar manor, check if line current and next line line numbers are
>> sequential.  If so scan forward to match column content of lower line
>> number and check first matched column's line number and next for
>> sequential.  Print them out if so
>>
>> everything in outfile has 1 or more duplicates
>>
>> 4  aa               |--
>> 5  bb         |--      |      thus 4/5 match 100/101
>> 6  cc            |     |
>> .                |     |
>> 100 aa           |  |--
>> 101 bb         |--
>> 102 ddd
>> 103 cc                  there is a duplicate but not a sequence
>> 200 ff
>>
>> mark duplicate sequences as tested and proceed on through
>>    seq1 may have more than one other seq in file.
>>    the progress is from start to finish without looking back
>>    thus each step forward has fewer lines to test.
>>    marking already knowns eliminates redundant sequence testing.
>>
>> By subseting on pass1 the expensive testing is greatly reduced.
>> If you know your subset data won't exceed memory then the "outfile"
>> can be held in memory to speed things up considerably.
>>
>> Today is: 20090520
>> no code
>>
>> Steve- Hide quoted text -
>>
>> - Show quoted text -
> 
> this is the program...I wrote but is not working
> I have a list of valves, and another of pressures;
> If I am ask to find out which ones are the valves that are using all
> this set of pressures, wanted best pressures
> this is the program i wrote but is not working properly, it suppossed
> to return in the case
> find all the valves that are using pressures 1 "and" 2 "and" 3.
> It returns me A, A2, A35....

looking at the data that seems correct.
there are 3 '1's in the list, 1-A, 1-A2, 1-A35
there are 2 '2's in the list, 2-A, 2-A2
there are 2 '3's in the list, 3-A, 3-A2
and so on

   after the the two sets are paired
indexing on the right yields 1-A,2-A,3-A,1-A2,2-A2,3-A2,7-A4...
indexing on the left  yiels1 1-A,1-A2,1-A35,2-A,2-A2,3-A,3-A2,7-A4...
and the two 78s would pair with a G and with a G2  (78-G, 78-G2)
beyond that I'm a bit lost.

20090521 Steve

> The correct answer supposed to be A and A2...
> if I were asked for pressures 56 and 78 the correct answer supossed to
> be valves G and G2...
> 
> Valves = ['A','A','A','G', 'G', 'G',
> 'C','A2','A2','A2','F','G2','G2','G2','A35','A345','A4'] ##valve names
> pressures = [1,2,3,4235,56,78,12, 1, 2, 3, 445, 45,56,78,1, 23,7] ##
> valve pressures
> result = []
> 
> bestpress = [1,2,3] ##wanted base pressures
> print bestpress,'len bestpress is' , len(bestpress)
> 
> print len(Valves)
> print len(Valves)
> for j in range(len(Valves)):
> #for i in range(len(bestpress)):
>     #for j in range(len(Valves)):
>     for i in range(len(bestpress)-2):
>             if pressures [j]== bestpress[i] and bestpress [i+1]
> ==pressures [j+1] and bestpress [i+2]==pressures [j+2]:
>                 result.append(Valves[j])
>                 #i = i+1
>                 #j = j+1
>             # print i, j, bestpress[i]
>                 print "common PSVs are", result