finding repeated data sequences in a column

Thu May 21 04:41:44 EDT 2009

yadin wrote:

> On May 20, 6:53 pm, norseman <norse... at hughes.net> wrote:
>> bearophileH... at lycos.com wrote:
>> > yadin:
>> >> How can I build up a program that tells me that this sequence
>> >> 1000028706
>> >> 1000028707
>> >> 1000028708
>> >> is repeated somewhere in the column, and how can i know where?
>>
>> > Can such patterns nest? That is, can you have a repeated pattern made
>> > of an already seen pattern plus something else?
>> > If you don't want a complex program, then you may need to specify the
>> > problem better.
>>
>> > You may want something like LZ77 or releated (LZ78, etc):
>> >http://en.wikipedia.org/wiki/LZ77
>> > This may have a bug:
>> >http://code.activestate.com/recipes/117226/
>>
>> > Bye,
>> > bearophile
>>
>> ============================================
>> index on column
>> Ndx1 is set to index #1
>> Ndx2 is set to index #2
>> test Ndx1 against Ndx2
>> if equal write line number and column content to a file
>> (that's two things on one line:  15 1000028706
>> 283 1000028706 )
>> Ndx1 is set to Ndx2
>> Ndx2 is set to index #next
>> loop to test    writing out each duplicate set
>>
>> Then use the outfile and index on line number
>>
>> In similar manor, check if line current and next line line numbers are
>> sequential.  If so scan forward to match column content of lower line
>> number and check first matched column's line number and next for
>> sequential.  Print them out if so
>>
>> everything in outfile has 1 or more duplicates
>>
>> 4  aa               |--
>> 5  bb         |--      |      thus 4/5 match 100/101
>> 6  cc            |     |
>> .                |     |
>> 100 aa           |  |--
>> 101 bb         |--
>> 102 ddd
>> 103 cc                  there is a duplicate but not a sequence
>> 200 ff
>>
>> mark duplicate sequences as tested and proceed on through
>> seq1 may have more than one other seq in file.
>> the progress is from start to finish without looking back
>> thus each step forward has fewer lines to test.
>> marking already knowns eliminates redundant sequence testing.
>>
>> By subseting on pass1 the expensive testing is greatly reduced.
>> If you know your subset data won't exceed memory then the "outfile"
>> can be held in memory to speed things up considerably.
>>
>> Today is: 20090520
>> no code
>>
>> Steve- Hide quoted text -
>>
>> - Show quoted text -
> 
> this is the program...I wrote but is not working
> I have a list of valves, and another of pressures;
> If I am ask to find out which ones are the valves that are using all
> this set of pressures, wanted best pressures
> this is the program i wrote but is not working properly, it suppossed
> to return in the case
> find all the valves that are using pressures 1 "and" 2 "and" 3.
> It returns me A, A2, A35....
> The correct answer supposed to be A and A2...
> if I were asked for pressures 56 and 78 the correct answer supossed to
> be valves G and G2...
> 
> Valves = ['A','A','A','G', 'G', 'G',
> 'C','A2','A2','A2','F','G2','G2','G2','A35','A345','A4'] ##valve names
> pressures = [1,2,3,4235,56,78,12, 1, 2, 3, 445, 45,56,78,1, 23,7] ##
> valve pressures
> result = []
> 
> bestpress = [1,2,3] ##wanted base pressures
> print bestpress,'len bestpress is' , len(bestpress)
> 
> print len(Valves)
> print len(Valves)
> for j in range(len(Valves)):
> #for i in range(len(bestpress)):
>     #for j in range(len(Valves)):
>     for i in range(len(bestpress)-2):
>             if pressures [j]== bestpress[i] and bestpress [i+1]
> ==pressures [j+1] and bestpress [i+2]==pressures [j+2]:
>                 result.append(Valves[j])
>                 #i = i+1
>                 #j = j+1
>             # print i, j, bestpress[i]
>                 print "common PSVs are", result

If I understand your explanation correctly you don't actually care about the 
sequence, you just want all valves that show the pressures specified in 
bestpress. If that's correct your problem becomes easier. You can make a 
lookup table that maps a given pressure to a set of all valves that had it 
and then calculate the intersection.

from collections import defaultdict

def intersection(sets):
    return reduce(set.intersection, sets)

valves = ['A', 'A', 'A', 'G', 'G', 'G', 'C', 'A2', 'A2', 'A2', 'F', 'G2', 
'G2', 'G2', 'A35', 'A345', 'A4']
pressures = [1, 2, 3, 4235, 56, 78, 12, 1, 2, 3, 445, 45, 56, 78, 1, 23, 7]

lookup = defaultdict(set)
for v, p in zip(valves, pressures):
    lookup[p].add(v)

result = []

wanted_pressures = [1, 2, 3]
print wanted_pressures, "-->", intersection(lookup[p] for p in 
wanted_pressures)

Peter