Struggling for inspiration with lists

Tue Dec 17 23:50:36 EST 2013

On 2013-12-18 03:51, Denis McMahon wrote:
> I need to keep the timestamp / data association because I need to 
> generate output that identifies (a) the longest repeated sequence
> (b) how many elements in the longest repeated sequence (c) at what
> timestamps each occurrence started.
> 
> I'm not sure of the best way, programatically, to aproach this
> task, which means I'm unsure whether eg a list of tuples ( time,
> data ) or an OrderedDict keyed on the timestamp is the best
> starting point.
[snip] 
> Is there a better way to do this?

You might have to define your criteria for "best".  Things that
might play into it:

- what happens if there's more than one "longest" sequence?

- can you specify a minimum length that you don't have to track? (a
  maximum sequence of length-1 might be uninteresting)

- how large can your input data grow to be?  If it's huge, you're
  looking at searching a problem space something like O(N^2) or
  possibly O(N log N) if you're lucky.  Which in turn can involve
  lots of memory consumption.

- do you just want to know what that sequence is, or do you want to
  know all the locations/offsets where it the longest sequence was
  found?

I've played with two simple versions, the first is naive, while the
second tries to use a little bit of smarts to eliminate possibilities
when you know you have at least one match of a certain length.  The
result of the function does allow for finding a plurality of longest
sequences, but my demo code just snags one with the max() function.
I leave the "find multiple maxes" as an exercise for the reader. :)

-tkc

def build1(d):
    sequences = {} # sequence -> [list of start offsets]
    for i in range(len(d)):
        for j in range(i+1, len(d)):
            key = tuple(value for timestamp, value in d[i:j])
            sequences.setdefault(key, []).append(i)
    return sequences

def build2(d):
    sequences = {} # sequence -> [list of start offsets]
    longest = 1
    for i in range(len(d)):
        for j in range(i+longest, len(d)):
            key = tuple(value for timestamp, value in d[i:j])
            if key in sequences:
                sequences[key].append(i)
                if len(sequences[key]) > longest:
                    longest = len(sequences[key])
                    print longest
            else:
                sequences[key] = [i]
    return sequences

data = {
    0: "d", 1: "x", 2: "y", 3: "t", 4: "d", 5: "y", 77: "g",
    78: "h", 79: "x", 80: "y", 206: "t", 210: "d", 211: "x",
    }
d = data.items()
d.sort()

for fn in (build1, build2):
    print fn.__name__
    sequences = fn(d)
    longest_key, offsets = max((
        pair
        for pair in sequences.iteritems()
        if len(pair[1]) > 1
        ), key=lambda (k,v): len(k))
    print "Longest sequence %r found at %s" % (
        longest_key,
        [d[i][0] for i in offsets],
        )
    for k,v in sequences.iteritems():
        if len(v) > 1:
            print "%s: %r" % (k,v)
    print '=' * 50

.